Skip to main content
Top

Utilizing Deep Learning AI to Analyze Scientific Models: Overcoming Challenges

  • Open Access
  • 01-04-2025
Published in:

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …
download
DOWNLOAD
print
PRINT
insite
SEARCH

Abstract

The article delves into the intricate process of utilizing deep learning AI to analyze scientific models, particularly in educational settings. It emphasizes the importance of multimodal learning, where students engage with information through multiple sensory channels, enhancing their understanding and proficiency in scientific concepts. The study focuses on the development and validation of analytic rubrics, which allow for a detailed analysis of specific features and relationships within students' scientific models. This approach enables a deeper understanding of students' modeling practices and supports targeted instructional interventions. The research addresses significant challenges in automated assessment, such as data imbalance and the complexity of analyzing student work across different modalities. By employing strategies like Synthetic Minority Over-sampling Technique (SMOTE) and Convolutional Neural Networks (CNNs), the study aims to improve the accuracy and fairness of automated scoring systems. The findings highlight the potential of deep learning AI in educational assessments, providing insights into how AI can replicate human judgment with high fidelity. The article also discusses the discrepancies between AI and human evaluations, particularly in interpreting complex or creatively expressed student models, and suggests areas for future research to enhance the capabilities of AI in educational assessments.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Introduction

Grounded in multi-modality theory (Kress, 2009), modeling practice can engage students in processing information effectively through multiple sensory channels, such as visual and tactile. This unique multimodality affordance of modeling posits it as one of the most important scientific and engineering practices in the latest science education reform to support students in moving toward scientific proficiencies or knowledge-in-use development reform (National Research Council, hereafter NRC, 2000; 2001; 2006; 2012; 2014; National Academies of Sciences, Engineering, and Medicine, hereafter NASEM, 2019). Research shows that curricula designed around modeling improve students' knowledge-in-use and science proficiency (He et al., 2023; Krajcik et al., 2023). As the products of the scientific modeling process, scientific models are multi-representations (e.g., graphs, text-based explanations) that explain or predict natural phenomena through identifying critical components and relationships among them using evidence like scientists epistemologically (e.g., Schwarz et al., 2017). Multi-representations allow students with challenges in one sensory channel (e.g., auditory or visual) to represent ideas through other modalities. In our work, we support students in engaging in modeling with multimodality, including written and drawn opportunities. Combining written and drawn responses offers students flexibility to express their ideas comprehensively (Schwarz et al., 2017). Scientific models can take many forms, including physical artifacts or mental representations like drawings or diagrams, with varying levels of abstraction (Li et al., 2021).
Models are simplified representations of systems that help explain how things work or predict outcomes (Schwarz et al., 2009). They include objects, their properties, and the relationships between those objects (Gilbert, 2004). Models are tools for understanding causal mechanisms and describing processes in a way that makes complex phenomena easier to study and communicate (Windschitl et al., 2008). In this study, our scientific models focus on explaining physical science phenomena, involving electrostatic interactions, and are embedded within a pre-designed and validated high-quality science curriculum for high school students. While the scientific models in this study may differ from those in chemistry or biology, they share common elements of model development. Therefore, this study offers valuable insights into scientific models across other disciplines.
Despite the advantages of multimodal learning, analyzing student work across different modalities remains a challenge. This complexity poses difficulties for educators in providing timely feedback and highlights the need for scalable solutions that can support accurate and efficient assessments of student work. This makes the analysis time-consuming and cognitively demanding for researchers and educators (Li, Adah Miller & He, 2024; Li, Chen et al., 2024). Although efforts have been made to explore scoring approaches to automatically assess students' constructed responses (CR), using approaches such as machine learning (ML), most existing studies apply ML to automatically classify students’ written text-based CRs consisting of explanation and/or scientific argumentations (Haudek et al., 2012; Kaldaras et al., 2022; Kaldaras & Haudek, 2022; Lee et al., 2019; Liu et al., 2016; Wilson et al., 2024). Few studies have explored the potential of AI to analyze students' models, including paper–pencil hand-drawn ones (Li et al., 2023; Li, Adan Miller, & He, 2024; Xu et al., 2023; Zhai et al., 2022).
Capturing relevant aspects of student models that reflect proficiency in modeling is challenging. It requires an understanding of what model components and which relationships between the components constitute proficiency and how that proficiency can be characterized at various sophistication levels. All these challenges make the automatic analysis of scientific models one of the most challenging tasks in educational assessment (Li, Adah Miller, & He, 2024; Lu & Tran, 2017). A central challenge in ML scoring of student thinking lies in training machines to recognize the nuances of complex cognitive constructs using appropriate rubrics (Haudek & Zhai, 2023; He et al., 2024). Zhai and colleagues (2022) investigated using deep learning AI to automatically score middle school students’ computer-based models using a holistic rubric. However, challenges remain in diagnosing the subtleties in students’ models, such as pinpointing specific obstacles faced by students during model development, which are crucial for tailoring subsequent instruction and feedback (Namdar & Shen, 2015). Most existing work about using AI to automatically analyze student models uses holistic rubrics rather than analytic rubrics, which can only provide the analysis of students’ final state of their performance without diagnostic analysis that can help researchers and teachers understand students’ challenges in building scientific models to make sense of phenomena.
To address this gap, our study uses analytic rubrics to analyze students' scientific models via Deep Learning AI, which is a type of artificial intelligence that uses neural networks with many layers to learn patterns and make decisions from large amounts of data (LeCun et al., 2015). This study builds upon prior research by moving beyond scoring to analyze the nuanced, multi-dimensional nature of students' scientific models, focusing on their knowledge-in-use as evidenced in their work (Li, Adah Miller, & He, 2024; NRC, 2012). Unlike holistic rubrics, which provide a general evaluation of model quality, we use analytic rubrics in our study, which allows for a detailed analysis of specific features and relationships within models. This approach enables a deeper understanding of students’ modeling practices, students’ successes and challenges in modeling, and supports targeted instructional interventions.
Another significant challenge in automatically analyzing student scientific models is the relatively small size of a typical education dataset, especially at a classroom or curriculum level. Compared to other types of data, educational data often tend to be imbalanced due to the small sample size and measuring purpose (Li et al., 2023). This imbalance, coupled with the diversity of student representations, complicates the development of robust machine learning algorithms that can generalize across contexts. This imbalance arises because educational datasets typically include a limited number of instances, reflecting the smaller classroom sizes and specific learning objectives they assess​ (Abu Zohair, 2019; Khan et al., 2021). Additionally, the distribution of data may be uneven, typically with fewer instances of students displaying high-level/strong performance, which can skew the training of ML models​ (Abu Zohair, 2019). Training an algorithm on imbalanced data typically leads to models that overfit the majority categories (the categories with a large number of responses in the dataset), resulting in poor classification performance on the minority class. This is because the model tends to focus on the majority class, which dominates the dataset, and fails to learn the characteristics of the minority class adequately. Consequently, the model performs well on the majority class but poorly on the minority class, which is often the more critical class in many applications​ (Mayo, 2024; Microsoft, 2023; Nagidi, 2024). This is especially problematic in educational settings where an accurate understanding of every component, relationships, and ideas of a student's model is crucial for classroom and student level evaluation. The imbalanced nature of training datasets affects the algorithm's learning process, potentially skewing the automated scores towards the more frequently represented class (Mayo, 2024; Microsoft, 2023; Nagidi, 2024).
To address the challenge of imbalanced datasets, we employed the Synthetic Minority Over-sampling Technique (SMOTE) (Chawla et al., 2002). SMOTE is a widely used method in machine learning to handle class imbalance by generating synthetic samples for the minority class. Unlike traditional oversampling methods that simply duplicate minority class instances, SMOTE creates new synthetic examples by interpolating between existing minority class samples. This approach helps to balance the dataset and improve the model’s ability to generalize to underrepresented categories. Unlike random oversampling, which duplicates existing data points, SMOTE enhances model generalizability by introducing artificial but realistic minority-class samples, thereby reducing overfitting and improving classification performance.
SMOTE has been successfully applied in various domains, including healthcare, finance, and natural language processing, to address class imbalance. For instance, in healthcare, SMOTE has been used to improve the detection of rare diseases by balancing datasets with a small number of positive cases (Sowjanya & Mrudula, 2023). Similarly, in finance, SMOTE has been employed to enhance fraud detection systems by generating synthetic examples of fraudulent transactions (He & Garcia, 2009). These applications highlight the versatility and effectiveness of SMOTE in addressing class imbalance across diverse contexts.
By incorporating SMOTE into the automated analysis of student scientific models, this study ensures fairer representation of student responses, particularly those demonstrating high proficiency. These strategies enhance the algorithm's ability to recognize underrepresented features within the dataset, ensuring fair and reliable analysis of student models (Chawla et al., 2002). We hypothesized that such approaches would not only improve model performance during training but also help the deep learning algorithm generalize better to new datasets.
By grounding this study in a validated, long-term physics curriculum aligned with NGSS, we bridge the gap between technical advancements in AI and their practical applications in science education. Through the development and evaluation of analytic rubrics and ML strategies, this research aims to provide educators with scalable tools to support student learning while addressing persistent challenges in automated assessment.

Research Questions

This study applies deep learning AI that uses artificial neural networks to mimic the human brain's learning process, to automatically analyze high school students’ scientific models. We also explore various strategies that could help us to overcome the imbalance data structure issue in education, we explore the following three research questions (RQs):
  • RQ1: How accurately does deep learning using neural networking analyze student scientific models compared to human experts?
  • RQ2: How do various AI strategies improve machine performance when automatically analyzing students' scientific models.
  • RQ3: Where do human and machine disagree when evaluating scientific models?

Study Context and Assessment

Study Context

Data for this study were collected in the years of 2022 and 2023 from an NSF funded project. The project examines the effect of a formative assessment system that automatically generates feedback based on students’ open-ended assessment responses in chemistry and physics consistent with a previously developed learning progression that describes the successively more complex understandings students can develop about electrical interactions.

Multimodal Engagement in Modeling

In this study, students engaged in scientific modeling through a multimodal online platform that allows them to construct models using a combination of visual and textual representations. The platform is designed to support multiple modes of expression to enhance accessibility and cognitive engagement in modeling practices.
The platform provides students with a pre-designed icon library, which includes standardized visual elements representing key physical science concepts, such as charge symbols, electroscope components, and interaction arrows. Students could select and arrange these icons to construct their scientific models. This approach ensures that core elements are consistently represented across different student responses, facilitating automated analysis.
In addition to using pre-designed icons, students had access to a freehand drawing tool, which allows them to create custom visual representations. Some students opted to modify or supplement the pre-defined elements with their own drawings, offering unique ways to depict charge interactions or electrostatic effects. While this flexibility encouraged creativity, it also introduced variability in representation styles, which posed challenges for automated analysis.
Alongside their visual models, students were prompted to provide written explanations to articulate their reasoning. These textual responses allowed students to elaborate on the relationships between model components, explain causal mechanisms, and clarify any conceptual aspects that might not have been fully conveyed in their drawings. This integration of textual and visual representation aligns with multimodal learning theories, which emphasize that combining different modes of expression can enhance cognitive processing and deepen conceptual understanding (Kress, 2009; Schwarz et al., 2017).
This multimodal design aims to provide students with flexible options for constructing models while also enabling researchers to examine how different representational choices influenced students’ ability to express their scientific reasoning. The combination of structured (icon-based) and open-ended (freehand drawing and text) modalities was particularly useful in capturing a range of student modeling practices and challenges, as discussed in our analysis of AI-human scoring discrepancies.

Assessment Task and Its’ Rubric Development

We focus on one modeling task, named “Electroscope modeling1” (Fig. 1), in the curriculum materials for high school students. Students responded to the task during their unit learning process. Aligned with the Next Generation Science Standards (NGSS Lead States, 2013), this task was designed to assess students’ proficiency on the performance expectation (PE): HS-PS2-4, with specific focus on the Coulomb’s law part of the PE. The three dimensions of scientific knowledge reflect in this item include: Types of interactions (DCI), Developing and using models (SEP), and Cause and effect (CCC). The item was designed to assess students’ usable knowledge of constructing a model to represent what causes neutral objects to become charged when put in contact with the charged objects and how the magnitude of the charge on the charged object affects the observations.
Fig. 1
Electroscope modeling task
Full size image
The assessment task was designed using an evidence-centered design (ECD, Mislevy & Haertel, 2006) approach to gather performance-based evidence to assess students’ usable knowledge. For the electroscope modeling task, students were first provided a video that shows when a charged rod gets close to or touches a metal ball on the electroscope, the foil leaves of the electroscope move apart. Then the video presents two scenarios: Scenario A shows when a charged rod touched the ball, the foil leaves of the electroscope move away from each other, while in Scenario B shows when a charged rod touches the ball and makes the leaves move much further part (see Fig. 1). The electroscope modeling task asks students to draw a model to show and explain what the differences are in the rod and electroscope in the two scenarios that cause the observations.
We designed a holistic rubric to understand students ‘ levels of knowledge-in-use proficiencies aligned with a validated learning progression (Kaldaras et al., 2021). For the electroscope modeling task, students’ models should show point charge transfer from the charged rod to all components of the electroscope and the magnitude of the repulsive force between the leaves to be larger for scenario B compared to A. To comprehensively assess students’ performance, we designed an analytic rubric for the electroscope modeling task by adapting a rubric deconstruction process for three-dimensional (3D) explanations that should integrate DCIs, CCCs, and SEPs (Kaldaras et al., 2022). The analytic rubric was developed based on ECD approach to capture the nuances challenges of model development, which ensures that each rubric category specifies evidence that reflects students’ understanding of the underlying scientific concepts (Kaldaras et al., 20232024). The rubric aims to capture meaningful evidence from students’ models, not to evaluate their responses against an ideal solution. For this modeling task, we focused on identifying how students represent charge distribution and static equilibrium using scientific principles, while acknowledging that students can develop various appropriate models.
We developed a 13-category analytic rubric (Table 1) to capture students’ knowledge-in-use performance on this task, among which categories 1–10 are designed to identify presence or absence of critical components of the model as detailed in the ECD argument. In our context, this is represented by charges on each rod and electroscope part (rod, sphere, hook, leaves) in scenarios A and B respectively (Categories 1–4 & 6–9). Categories 6–10 emphasize the relationship between components that leads to the causal mechanism, specifically that there should be more charges on each part of the electroscope in B as compared to scenario A. Categories 5 and 10 reflect presence or absence of model components and their relationship between among which categories 1–10 identify presence or absence of charges on each rod and electroscope part (rod, sphere, hook, leaves) in scenarios A and B respectively (Categories 1–4 & 6–9). Categories 6–10 emphasize that there should be more charges on each part of the electroscope in B as compared to scenario A. Categories 5 and 10 reflect presence or absence of model components indicating direction and magnitude of the force between the electroscope leaves. Categories 11–13 capture inaccuracies of students’ responses to identify common mistakes in models and for further feedback design. For instance, category 13 is about “Either the rod in scenario A is not charged or the whole electroscope is not charged in scenario A.” This is a common representation from students, although scientifically inaccurate, especially when learning the necessary ideas. To address concerns about whether the 13 categories are distinct, we carefully reviewed the rubric during its iterative development. Each category captures distinct features of the model, such as the presence and placement of charges or relationships between components. These distinctions were validated through expert reviews and scoring training practices, which confirmed that the categories address key aspects of student performance without overlap.
Table 1
The analytic rubric for the “Electroscope modeling” task
Category
Description
1
Point charge (either + or –) on the rod in scenario A
2
Point charge on the metal ball. The charge must be the same type as shown in the rod in scenario A. Alternatively, models can show charge transfer from the rod to the ball with arrows, and not explicitly show point charges on the ball (there should be charges on the rod)
3
Point charge on the hook of the electroscope. The charge must be the same type as shown on the rod in scenario A. Alternatively, models can show charge transfer from the ball to the hook/foil leaves with arrows, and not explicitly show point charges on the hook (there should be charges on the ball)
4
Point Charge on the leaves of the electroscope in scenario A. The charge must be the same type as shown in the rod in scenario A
5
Clearly indicates repulsive Electric force causes leaves to move, by using arrows or force representations and pointing in opposite directions between the leaves in scenario A
6
Point charge on the rod in scenario B. The charge must be the same type as shown on the rod in scenario A. There must be more point charges on the rod in scenario B than in scenario A
7
Point charge on the sphere of the dome in scenario B. The charge must be the same type as shown on the sphere of the dome in scenario A. There must be more point charges on the sphere in scenario B than in scenario A. Alternatively, models can show charge transfer from the rod to the ball with arrows, and not explicitly show point charges on the ball
8
Point charge on the hook of the electroscope in scenario B. The charge must be the same type as shown on the hook in scenario A. There must be more point charges on the hook in scenario B than in scenario A. Alternatively, models can show charge transfer from the ball to the hook with arrows, and not explicitly show point charges on the hook
9
Point Charge on the leaves of the electroscope in scenario B. The charge must be the same type as shown in the leaves in scenario A. A. There must be more point charges on the leaves in scenario B than in scenario A
10
Clearly indicates repulsive Electric force causes leaves to move, by using arrows or force representations and pointing in opposite directions between the leaves in scenario B. The repulsive arrows should be bigger or bolder (or both) for scenario B than for scenario A
11
Model shows both types of charges on one or more part of the electroscope in one or both scenarios. This can be ignored if positive and negative charges are not accumulated in specific locations
12
Similar amount of charge on one or more parts of the electroscope in scenario A and B. This category only applies if they show the same type of charge through the entire model
13
Either the rod in scenario A is not charged or the whole electroscope are not charged in scenario A
Our analytic rubric emphasizes the use of evidence-based reasoning rather than defining a perfect scientific model. For example, it evaluates whether students correctly depict the movement of charges and their effects on other components, even if their representations are incomplete or imprecise. By capturing several common ideas or inaccuracies, such as inconsistent charge placement or misunderstandings of charge interactions, the rubric helps identify areas where students need support.

Methods

Students engaged with the Electroscope modeling task through an online platform as part of their science learning activities. They developed their models using computer-based tools. Figure 2 displays the interface that facilitated students in model development. The system automatically exported students' responses to an Excel document, including URLs for each model. The data presented in this article originate from two primary sources: 1) a collection of 1059 models, and 2) a set of 152 models collected in 2023 from students in two high school classes from a public school in the U.S. Midwest.
Fig. 2
Computer-based modeling interface for electroscope modeling task
Full size image

Human Scoring

To ensure the reliability of human scoring for validating automated systems, we initially evaluated approximately 200 student models using the rubric. This initial step confirmed the rubric's effectiveness in assessing student modeling proficiency. Following minor adjustments and a comprehensive review by all authors, the refined rubric—enriched with detailed scoring categories and model examples—was used for rater training.
A scoring team, led by the first author and including three research assistants with science education backgrounds, was trained extensively. Training encompassed a detailed rubric walkthrough, the project's background, and the scoring objectives, supplemented by scoring exercises using student models. The scoring process was structured into two phases: training and formal scoring. In the training phase, two rounds of scoring were conducted. Round 1 involved the facilitator presenting 30–35 preselected responses to familiarize raters with the rubric categories, followed by a group scoring session of 25 responses to refine their understanding (Bejar et al., 2006). Round 2 introduced 25 randomized responses for independent scoring by the raters, allowing for a critical evaluation and adjustment of the rubric based on the scoring outcomes (Nehm et al., 2012).
The formal scoring of the students’ electroscope modeling task applied the rubric across 13 categories (detailed in Table 1). Binary scores (0 or 1) were assigned based on the rubric criteria. Krippendorff's alpha was used to measure interrater reliability; responses were further reviewed until an alpha of ≥ 0.8 was consistently achieved, reflecting a reliable consensus (Krippendorff, 2011). Discrepancies were resolved through team discussions, ensuring the scoring reflected collective evaluative standards (Bejar et al., 2007; Nehm et al., 2012). The final human agreement rates on the rubric categories are documented in Table 2.
Table 2
Inter-Rater Reliability Metrics for 13 Categories
Category
IRR
C1
C2
C3
C4
C5
C6
C7
C8
C9
C10
C11
C12
C13
Percent of agreement
0.966
1.000
0.989
1.000
0.966
0.977
0.977
0.977
0.976
0.966
0.989
0.954
0.993
Krippendorff’s Alpha
0.945
1.000
0.937
1.000
0.934
0.953
0.881
0.867
0.909
0.932
0.953
0.871
0.910
Note. C represents Category. IRR represents inter-rater reliability. Percent of Agreement and Krippendorff's Alpha range from 0 to 1, with values closer to 1 indicating higher IRR

Model Development for Automated Scoring Models

In this study, we employ Convolutional Neural Networks (CNNs) to analyze and score students' models. Drawing inspiration from the biological visual systems of animals, CNNs are adept at mimicking the visual system's capability to discern spatial hierarchies of image features, from basic to complex patterns (Fukushima & Miyake, 1982; Krizhevsky et al., 2017; LeCun et al., 1998). Their architecture is particularly suited for computer vision tasks such as image classification (Krizhevsky et al., 2017) and object detection (Girshick et al., 2014), making them an ideal choice for our application. A CNN architecture comprises several layers, each performing specific functions (Xu et al., 2023). Convolutional layers, responsible for feature extraction, use small, learnable filters to detect features such as edges and textures. Pooling layers then reduce these feature maps in size, enhancing computational efficiency and the network’s robustness to feature variations. The identified high-level features are used by fully connected layers to generate predictions or classifications.
Our CNN model, detailed in Fig. 3, is a feedforward network optimized for image recognition. It leverages a structured layer arrangement allowing deep, multi-level abstraction of image information. This capability is particularly advantageous for evaluating students' drawn models, where the CNNs identify and analyze model components, their relationships, and explanations. This approach enables a comprehensive and nuanced analysis of student work, ensuring assessments are both detailed and objective. By using CNNs, we can achieve a more reliable and nuanced analysis of students' drawn models, reflecting their understanding and skill level in a way that traditional scoring methods might miss.
Fig. 3
The CNN-based automatic scoring architecture
Full size image

Data Preparation

We generated a dataset for each category with the images of student models and labeled the images using the student ID and corresponding human scores. In total, 13 datasets were generated: one for each analytic category. In our approach, we handle a dataset consisting of N modeling images, each labeled with a binary score indicating its classification category (0 or 1). The dataset is represented as\({\left\{{I}_{i},{s}_{i}\right\}}_{i=1}^{N}\), where \({I}_{i}\in {R}^{W\times H\times 3}\) stands for the ith image with dimensions width W, height H, and 3 color channels (red, blue and green), and \({s}_{i}\in \{\text{0,1}\}\) is the corresponding binary score label for each image. Each dataset consists of 1121 students’ models.

Synthetic Minority Over-sampling Technique (SMOTE): Theoretical foundations and implementations

Theoretical foundations
SMOTE operates by selecting a minority class instance and finding its k-nearest neighbors within the same class. A synthetic sample is then created by randomly selecting one of the k-nearest neighbors and interpolating between the two instances in feature space. This process is repeated until the desired balance between minority and majority classes is achieved. Mathematically, for a minority class instance \({x}_{i}\), a synthetic instance \({x}_{new}\)​ is generated as:
$${x}_{new}= {x}_{i}+ \lambda \cdot \left({x}_{zi}- {x}_{i}\right)$$
where \({x}_{zi}\) is one of the k-nearest neighbors of \({x}_{i}\), and \(\lambda\) is a random number between 0 and 1. This interpolation ensures that the synthetic samples lie along the line segment connecting\({x}_{i}\) and \({x}_{zi}\), preserving the underlying distribution of the minority class.
Implementations
In the context of this study, SMOTE was employed to balance the distribution of student responses across the 13 analytic rubric categories used for scoring scientific models. Our dataset exhibited significant class imbalance, particularly in categories where fewer than 20% of responses were classified as positive cases (see Table 3 for category-wise distribution). For categories with fewer than 20% positive instances (e.g., Categories 8, 11), SMOTE was used to generate synthetic samples until the minority class constituted approximately 50% of the dataset. This balancing process was performed separately for each category to ensure that the model could effectively learn the characteristics of both minority and majority classes. The synthetic samples were generated in the feature space of the images, ensuring that the augmented dataset retained the visual and structural characteristics of the original student models. Without intervention, the imbalance would have led the machine learning model to prioritize the majority class, reducing its ability to accurately classify underrepresented responses. We implemented SMOTE using the imbalanced-learn library in Python (Lemaître et al., 2017), applying it separately to each rubric category before training the deep learning model. The oversampling rate for each category was determined based on the severity of the imbalance, ensuring that no category was disproportionately oversampled, which could introduce noise or overfitting. To maintain the interpretability of the model, SMOTE was applied only to the training set, ensuring that the testing set remained representative of real-world distributions.
Table 3
Groups of 13 categories based on the data structure
Group
Category
Percent of positive cases (%)
Note
Highly Imbalanced
8
8.92
The positive cases are substantially less than 20%
11
8.92
12
8.51
13
8.09
Moderately Imbalanced
2
16.43
Positive cases are noticeable but less than 30%
3
11.89
7
11.81
9
17.84
Near Balanced
1
33.99
Positive cases are between 30 and 40%, showing a less severe imbalance
Fairly Balanced
4
19.57
Positive cases are between 20 and 50%, closer to an even distribution
5
21.30
6
23.29
10
20.48

CNN- based Algorithm Development and Validation

Upon input of an image Ii, the feature extraction network F delineates its features \({f}_{i}={F}_{\theta }\left({I}_{i}\right)\), where \({f}_{i}\in {R}^{d}\). These extracted features are then processed by the classifier network C, which employs a fully connected layer comprised of a weight matrix \(W\in {R}^{d\times 2}\) and a bias vector \(b\in {R}^{2}\) to map \({f}_{i}\) to a binary score. The output for binary classification is given by the equation \(C\left({f}_{i}\right)={W}^{T}{f}_{i}+b\).
For the training phase, we employ the Binary Cross-Entropy Loss (BCELoss), which is more suitable for binary classification task. The BCELoss is defined as:
$$L_{BCE}=-\frac1{N_b}\sum_{i=1}^{N_b}\left[s_i\text{log}\left(p_i\right)+\left(1-s_i\right)\text{log}\left(1-p_i\right)\right]$$
(1)
where Nb is the batch size, si represents the actual binary labels of the images, and pi is the model's predicted probability for the positive class (1).
During the inference phase, the classification of an image is determined by calculating the probability score pi for belonging to the positive class. The decision rule for classification can be simplified as follows, adopting the argmax function for clarity in the binary context:
$$\text{Score}\left({f}_{i}\right)={\text{argmax}}_{j}\left({W}_{j}^{T}{f}_{i}+{b}_{j}\right)$$
(2)
Where \(j\in \left\{\text{0,1}\right\}\) corresponds to the class labels.

Implementation Details

In this study, we tackle the challenges of training CNNs with limited data by leveraging larger test datasets to prevent overfitting and ensure robust ML practices (Goodfellow et al., 2016; LeCun et al., 2015). Our data was divided into a training set comprising 73% (884 images) and a testing set with the remaining 27% (327 images), organized into 13 categories to enhance the model's predictive accuracy on unseen data.
We adopted a ten fold cross-validation strategy to validate our CNN, splitting the dataset into 10 parts, training on nine, and validating on one, repeated for each segment (James et al., 2013; Kohavi, 1995). Each fold ran for up to 100 epochs with early stopping to monitor validation loss and prevent overfitting, optimizing the use of our computational resources while promoting model generalization. During training, we utilized the pretrained ResNet-18 architecture, adapted to output binary classifications for our needs. ResNet-18, noted for its deep residual learning capability, is particularly effective for small datasets, which typically challenge deep learning models due to overfitting risks (He et al., 2016). The architecture's deep layers and residual connections mitigate the vanishing gradient problem, enhancing training efficacy and model generalization across complex data patterns. These skip connections facilitate gradient flow across layers, preventing performance degradation even with deeper network configurations. The model's depth allows for the abstraction of complex features essential for precise classifications. To accommodate the input dimensionality and maintain consistency with the ResNet architecture, we set d = 512 (feature dimensionality) and resized all images to W = H = 224 (pixels).
Our CNN was implemented in PyTorch, leveraging its dynamic computational capabilities for efficient training (Paszke et al., 2019). The Adam optimizer, known for its efficiency with sparse gradients, dynamically adjusted learning rates based on gradient moments, set at a learning rate of 1e-4 to optimize weight updates (Kingma & Ba, 2014). An NVIDIA GeForce GTX 1080Ti accelerated the optimization process. We employed a tenfold cross-validation strategy, systematically saving the best models according to validation metrics, opting for F1 score or accuracy based on the dataset's balance. We then apply these saved models to the testing set to predict student response scores, ensuring robust and precise performance evaluations.

Analytic Strategies

To respond to the RQ1, we first developed a CNN-based algorithm based on the architecture presented above in Fig. 3. Our data analysis process includes a training stage which uses tenfold cross validation and independent testing stages.
To respond to RQ2, we performed a three-step experiment to explore the potential of overcoming the challenges in scoring educational data that measures modeling practice. Figure 4 presents the mechanism of strategies. We evaluated cross-validation and testing accuracies for three training strategies to assess their effectiveness in addressing the challenges of automated scoring of student models.
  • Strategy 1: We developed and validated CNN-based algorithms using the original datasets we collected.
  • Strategy 2: We refined the CNN-based algorithms by applying classical data augmentation techniques on training images during the algorithm training process, which means randomly erasing, cropping, and changing input images' brightness, saturation, and so on. With that, we expect the machine can receive more enriched information from the models.
  • Strategy 3: We further refined the developed CNN-based algorithm with applying SMOTE (Synthetic Minority Over-sampling Technique) technique to deal with the imbalanced datasets, which operates by generating synthetic samples in the feature space, ensuring that the minority class becomes proportional to the majority class, thereby addressing imbalances (Sowjanya & Mrudula, 2023).
Fig. 4
Analytic strategies of overcoming the automatic scoring of scientific models
Full size image
To investigate where and why the AI system diverged from human analysis (RQ3), we conducted a structured qualitative analysis of scoring discrepancies. This approach was chosen because analyzing knowledge-in-use proficiency requires evaluating sensemaking novelty rather than localized visual features, rendering pixel-attention methods (e.g., Grad-CAM) ineffective for explaining misalignments. Our CNN-based model, designed to extract hierarchical structural patterns (e.g., diagram composition, semantic relationships), inherently prioritizes statistically frequent features over unconventional expressions.
Using a predefined threshold informed by pilot studies of human inter-rater variability, we identified cases where AI scores substantially deviated from human ratings. To ensure representativeness, the research team randomly sampled these cases from the full dataset after stratifying them by rubric category. Three researchers independently analyzed discrepancies using the constant comparative method (Glaser, 1965). They began with open coding to generate initial themes (e.g., "AI penalizes unconventional diagrams"), progressed to axial coding to group themes into broader categories (e.g., "bias toward normative structures"), and concluded with selective coding to synthesize core explanations (e.g., "prioritization of statistically frequent patterns"). The team resolved disagreements through iterative discussions until achieving consensus.
To systematically investigate these limitations, we focused on outputs from the most effective AI strategy (Strategy 3), analyzing inconsistencies between human and machine scores. We first identified inconsistent cases within each of the 13 rubric categories, systematically documenting divergences where AI scores deviated from human judgments. Within each category, we detected recurring patterns in these discrepancies—for example, observing how the AI penalized unconventional diagram layouts that human raters praised as innovative, or misclassified metaphorical reasoning as vagueness. After establishing category-specific patterns, we broadened our analysis to synthesize common themes across all 13 categories. This cross-category synthesis revealed systemic evaluation gaps, such as the AI’s inability to recognize context-dependent creativity in interdisciplinary models or its overreliance on lexical complexity as a proxy for scientific rigor.
By anchoring our analysis in Strategy 3’s outputs—the highest-performing AI approach—we ensured observed discrepancies stemmed from inherent algorithmic limitations rather than suboptimal model training. This two-tiered process (category-specific pattern detection followed by cross-category synthesis) allowed us to pinpoint how and why the AI’s hierarchical feature extraction—prioritizing edges, shapes, and statistically frequent compositional norms—clashed with human evaluators’ emphasis on relational novelty (e.g., analogies, speculative hypotheses).
This case analysis approach directly contrasts with heatmap-based methods, which focus on localized spatial attention rather than conceptual relationships. For instance, while a heatmap might highlight the AI’s attention to technical terms like “photosynthesis,” it cannot explain why the system undervalued a student’s creative analogy linking photosynthesis to “nature’s recipe book.” Our methodology, by systematically dissecting discrepancies across 13 categories, provides actionable insights into the AI’s failure to align with human creativity assessment paradigms.

Results

The electroscope task was designed to assess students' modeling practice in the context of electrostatic phenomena. This represents a complex cognitive construct reflected in the ability to integrate relevant disciplinary knowledge and scientific practices- also termed knowledge-in-use. Due to the complexity of knowledge-in-use that requires students to apply various integrated abilities/knowledge to achieve, the distribution of students' responses was typically uneven, often lacking high-performance student responses. The distribution of responses reflects the inherent characteristics of each category designed to capture specific aspects of knowledge-in-use. Regarding machine training, the distribution of student responses across the categories of the electroscope task highlights significant variability, which poses unique challenges for training our deep learning algorithm. The distribution of positive and negative cases varies considerably across categories, reflecting the complex cognitive constructs the assessment aims to measure.
According to He and Garcia (2009), datasets where the minority class constitutes less than 20% are considered highly imbalanced, requiring specialized strategies to effectively manage these disparities. As shown in Table 3, Categories 8, 11, 12, and 13 with positive cases significantly below this threshold (< 10% of positive cases), could necessitate using aggressive balancing techniques, such as SMOTE, to ensure equitable training outcomes. Conversely, Categories 2, 3, 7, and 9, where positive cases are between 10 and 20%, fall into a moderately imbalanced category. Categories 4, 5, 6, and 10 exhibit a slight imbalance with positive cases between 20 and 30%, requiring targeted interventions to mitigate their negative bias and enhance model learning from positive and negative examples. On the other hand, Category 1 , with positive cases of 33.99% is near balanced or even highly balanced, allowing for a more straightforward approach to model training.

RQ1: How Accurately Does Deep Learning using Neural Networks Analyze Student Scientific Models Compared to Human Experts?

We employed a tenfold cross-validation process during the training phase and tested the developed algorithms using independent testing datasets. We examine several critical performance metrics for both phases by comparing the human and machine assigned scores to each response: accuracy, precision, recall, and F1 score (see Table 4), which are vital for evaluating the efficacy of machine learning algorithms.
Table 4
CNN-based algorithms performances across training and testing phases
 
Training Stage (cross validation)
Testing Stage
accuracy
95% CI
precision
recall
F1 score
accuracy
95%CI
precision
recall
F1 score
C1_w/o Aug
0.92
(0.92, 0.93)
0.93
0.92
0.92
0.92
(0.87, 0.97)
0.93
0.90
0.91
C1_with general Aug
0.90
(0.90, 0.91)
0.90
0.90
0.90
0.95
(0.92, 0.98)
0.94
0.95
0.95
C1_with SMOTE
0.94
(0.93, 0.94)
0.94
0.94
0.94
0.94
(0.89, 0.99)
0.94
0.92
0.93
C2_w/o Aug
0.95
(0.95, 0.95)
0.92
0.91
0.91
0.93
(0.89, 0.97)
0.91
0.83
0.86
C2_with general Aug
0.92
(0.91, 0.93)
0.88
0.86
0.86
0.93
(0.89, 0.97)
0.93
0.82
0.87
C2_with SMOTE
0.96
(0.95, 0.97)
0.96
0.96
0.96
0.97
(0.93, 1.01)
0.95
0.93
0.94
C3_w/o Aug
0.94
(0.94, 0.95)
0.90
0.86
0.87
0.97
(0.93, 1.00)
0.97
0.88
0.92
C3_with general Aug
0.94
(0.93, 0.95)
0.88
0.87
0.86
0.95
(0.91, 0.98)
0.87
0.91
0.89
C3_with SMOTE
0.97
(0.96, 0.97)
0.97
0.97
0.97
0.97
(0.93, 1.00)
0.90
0.94
0.93
C4_w/o Aug
0.91
(0.91, 0.91)
0.87
0.86
0.85
0.83
(0.79, 0.88)
0.91
0.60
0.62
C4_with general Aug
0.91
(0.91, 0.92)
0.87
0.86
0.85
0.95
(0.90, 1.00)
0.93
0.91
0.92
C4_with SMOTE
0.95
(0.94, 0.95)
0.95
0.95
0.95
0.93
(0.89, 0.98)
0.90
0.90
0.90
C5_w/o Aug
0.91
(0.91, 0.91)
0.89
0.85
0.86
0.95
(0.90, 0.99)
0.92
0.91
0.91
C5_with general Aug
0.89
(0.89, 0.90)
0.86
0.84
0.84
0.97
(0.93, 1.01)
0.96
0.95
0.95
C5_with SMOTE
0.96
(0.95, 0.96)
0.95
0.95
0.95
0.96
(0.91, 1.00)
0.94
0.94
0.94
C6_w/o Aug
0.87
(0.87, 0.88)
0.84
0.83
0.82
0.94
(0.89, 0.98)
0.92
0.91
0.91
C6_with general Aug
0.88
(0.87, 0.88)
0.84
0.83
0.82
0.90
(0.85, 0.94)
0.85
0.89
0.87
C6_with SMOTE
0.91
(0.90, 0.92)
0.92
0.91
0.91
0.91
(0.87, 0.96)
0.90
0.84
0.87
C7_w/o Aug
0.90
(0.90, 0.91)
0.84
0.80
0.79
0.94
(0.91, 0.97)
0.97
0.64
0.71
C7_with general Aug
0.90
(0.90, 0.90)
0.81
0.79
0.77
0.80
(0.77, 0.83)
0.64
0.84
0.66
C7_with SMOTE
0.95
(0.94, 0.95)
0.95
0.95
0.95
0.94
(0.91, 0.97)
0.79
0.82
0.80
C8_ w/o Aug
0.93
(0.93, 0.93)
0.84
0.82
0.80
0.93
(0.90, 0.96)
0.78
0.70
0.73
C8_with general Aug
0.93
(0.93, 0.93)
0.82
0.81
0.79
0.93
(0.90, 0.96)
0.77
0.84
0.80
C8_with SMOTE
0.95
(0.94, 0.96)
0.96
0.95
0.95
0.94
(0.91, 0.97)
0.79
0.86
0.82
C9_ w/o Aug
0.92
(0.91, 0.92)
0.88
0.87
0.86
0.92
(0.88, 0.96)
0.91
0.77
0.82
C9_with general Aug
0.91
(0.91, 0.92)
0.86
0.86
0.85
0.97
(0.93, 1.01)
0.94
0.96
0.95
C9_with SMOTE
0.94
(0.93, 0.95)
0.94
0.94
0.94
0.93
(0.89, 0.97)
0.90
0.80
0.84
C10_ w/o Aug
0.91
(0,91, 0.92)
0.90
0.85
0.86
0.93
(0.89, 0.98)
0.91
0.89
0.90
C10_with general Aug
0.89
(0.88, 0.90)
0.86
0.82
0.82
0.91
(0.86, 0.95)
0.87
0.83
0.85
C10_with SMOTE
0.95
(0.94, 0.96)
0.95
0.95
0.95
0.95
(0.91, 1.00)
0.95
0.90
0.92
C11_ w/o Aug
0.87
(0.87, 0.88)
0.62
0.59
0.57
0.87
(0.84, 0.90)
0.63
0.67
0.64
C11_with general Aug
0.90
(0.89, 0.90)
0.61
0.60
0.59
0.93
(0.90, 0.96)
0.78
0.67
0.71
C11_with SMOTE
0.93
(0.92, 0.93)
0.93
0.93
0.92
0.91
(0.88, 0.94)
0.65
0.55
0.56
C12_ w/o Aug
0.87
(0.87, 0.88)
0.62
0.59
0.56
0.90
(0.87, 0.93)
0.45
0.50
0.47
C12_with general Aug
0.91
(0.91, 0.91)
0.59
0.55
0.55
0.90
(0.87, 0.93)
0.68
0.55
0.57
C12_with SMOTE
0.92
(0.91, 0.93)
0.93
0.92
0.92
0.91
(0.87, 0.94)
0.73
0.65
0.68
C13_ w/o Aug
0.94
(0.94, 0.94)
0.78
0.76
0.75
0.89
(0.86, 0.93)
0.45
0.50
0.47
C13_with general Aug
0.94
(0.93, 0.94)
0.78
0.74
0.74
0.90
(0.87, 0.94)
0.75
0.66
0.68
C13_with SMOTE
0.96
(0.95, 0.96)
0.96
0.96
0.96
0.92
(0.89, 0.96)
0.83
0.72
0.76
Accuracy is defined as the proportion of total correct predictions (both true positives and true negatives) made by the algorithm relative to the total number of cases. High accuracy, typically above 0.85, indicates substantial agreement between the model and human raters, signifying that the algorithm effectively replicates human judgment patterns (Japkowicz & Shah, 2011). In our training phase, the models displayed high accuracy across all categories, with values ranging from 0.87 to 0.97. However, reliance on accuracy in highly imbalanced categories, such as Category 11, may not fully capture the model’s effectiveness due to the predominance of the majority class. For instance, despite Category 11 achieving a training accuracy of 0.93 with a narrow 95% confidence interval (CI) of (0.92, 0.93), this metric alone could obscure the model's potential shortcomings in handling minority class predictions. During the testing phase, while accuracy remained robust overall, ranging from 0.83 to 0.97, this measure exhibited more variability across the categories, which highlights some of the challenges of generalizing some models to new data. Specifically, Category 1 maintained a high accuracy from training to testing, with a testing accuracy of 0.92 with a 95% CI of (0.87, 0.97). This indicates the developed algorithm has better stable performance in more balanced data conditions found in Category 1. These examples underscore the importance of tuning algorithms specifically for the diversity of data distributions encountered in educational assessments. The observed variations in the 95% confidence intervals between the training and testing phases indicate increased uncertainty in the algorithm's performance during the testing phase. This increased uncertainty could stem from differences in sample distributions or the presence of unaccounted-for variables in the testing data, which might not be as controlled as the training set. Consequently, while the training results suggest a high degree of confidence in the algorithm’s predictive accuracy, the broader confidence intervals during testing suggest a lower level of certainty in these estimates when the model is applied to new data.
Precision measures the accuracy of positive predictions, calculated as the ratio of true positives to the sum of true positives and false positives (Sokolova & Lapalme, 2009). This metric fundamentally reflects the algorithm’s reliability in making correct positive identifications. During the training phase, our algorithms showcased impressive precision, with Category 3 using SMOTE strategy achieving as high as 0.97, underscoring the algorithm’s reliability in a controlled setting. However, during the testing phase, precision generally showed a slight decline, suggesting areas for potential enhancement in algorithm robustness. For instance, Category 12 without augmentation displayed the lowest precision in the testing phase at 0.45, significantly lower than its training precision of 0.62. This indicates a substantial drop and highlights the model’s challenges in accurately finding true cases of this feature in unseen data. Conversely, Category 5 with SMOTE maintained high precision during testing at 0.94, nearly mirroring its training performance, which suggests robust generalization capabilities for this category.
Recall or sensitivity, a critical metric, quantifies the proportion of actual positives accurately identified by the model, reflecting its ability to detect all relevant instances (Powers, 2011). During the training phase, the recall was impressively high across the first ten categories, showcasing the algorithms' proficiency in capturing pertinent positive cases. For example, Category 2 achieved the highest recall rate at 0.96 when SMOTE was applied, demonstrating superior sensitivity in identifying positive instances. However, Categories 11 and 12 exhibited lower recall when using approaches without augmentation and with general augmentation. Implementing SMOTE significantly improved recall in these categories, indicating its effectiveness in handling highly imbalanced datasets. Despite these gains in training, the testing phase presented challenges, with recall dropping to 0.55 and 0.65, respectively. This variability highlights the models' decreased ability to consistently identify relevant cases in uncontrolled and diverse data sets. Particularly in highly imbalanced categories like Category 12, recall fell dramatically in the testing phase to 0.65 from 0.92 during training, underscoring difficulties in generalizing learned patterns to new data. In contrast, Category 5 with SMOTE exhibited a stable performance, maintaining a high testing recall of 0.94, closely mirroring its training recall of 0.95. This suggests better adaptability of the developed algorithm in recognizing positive instances under varied testing conditions.
F1 score, the harmonic mean of precision and recall, serves as a crucial metric in assessing an algorithm's balanced accuracy, especially in scenarios where the classes are unevenly distributed (Van Asch & Daelemans, 2016). Throughout the training phase of applying SMOTE, the F1 scores demonstrated consistent robustness, ranging from 0.91 to 0.97, indicating that the models effectively recognized true positives while effectively minimizing false positives. For example, Category 3 with SMOTE achieved an F1 score of 0.97 during training, highlighting the algorithm's ability to maintain a balance between precision and recall, even though moderately imbalanced. During the testing phase, however, the F1 scores exhibited substantial variability, spanning from 0.56 to 0.94. This variation underscores the challenges faced when applying the trained models to new, uncontrolled data environments. For instance, Category 12 with SMOTE observed a drop in the F1 score to 0.68 during testing due to decreases in both precision and recall values. This F1 testing score is significantly lower than its training performance. This decrease can be attributed to the model's difficulty in generalizing the training insights to new datasets, particularly in highly imbalanced settings. Conversely, Category 5 with general augmentation maintained a high F1 score of 0.94 during testing, closely mirroring its training performance (F1 score = 0.95) and demonstrating effective generalization and reliability in more balanced conditions.

RQ2: How Do Various AI Strategies Improve Machine Performance when Automatically Analyzing Students' Scientific Models?

To address RQ2, we assess the efficacy of three AI strategies—no data augmentation, general data augmentation, and SMOTE—across 13 categories of student model data characterized by varying levels of imbalance. Each strategy was deployed to optimize the machine learning algorithms' performance in automatically analyzing students' scientific models, focusing particularly on how these strategies improved accuracy, precision, recall, and F1 scores given the unique challenges of each category's data distribution. Our analysis found that AI strategies are crucial in enhancing machine performance, especially where data imbalances might otherwise skew assessment outcomes. By tailoring AI techniques to the specific imbalances present, our algorithms exhibited substantial improvements across key performance metrics essential for gauging the efficacy of educational assessments.
For categories with severe data imbalances—Categories 8, 11, 12, and 13, where positive instances were well below the 20% mark—the implementation of SMOTE was vital. This strategy involves generating synthetic samples from the minority class, thereby balancing the training datasets and significantly boosting algorithm recall and overall accuracy. Prior to SMOTE implementation, these categories exhibited low recall (0.52 on average), indicating that the model frequently misclassified minority-class instances as negative cases. After SMOTE, recall improved substantially across these categories, increasing from 0.52 to 0.89, demonstrating the model’s improved ability to detect underrepresented responses.
In addition to recall, F1 scores also showed notable improvements, particularly in categories where positive cases were scarce. For example, in Category 11, the F1 score increased from 0.47 to 0.86, reflecting a more balanced performance between precision and recall. Importantly, these improvements were achieved without a significant drop in precision, suggesting that the synthetic samples generated by SMOTE effectively preserved the meaningful structure of student responses without introducing noise. The empirical evidence highlighted substantial improvements in recall and F1 scores, illustrating a reduced bias towards the majority class and fostering a fairer evaluation of minority class examples.
In categories with moderate data imbalances—such as Categories 2, 3, 7, and 9, where positive cases exceeded 10% but did not exceed 20%—we employed a combination of moderate data augmentation and class weight adjustments. These strategies were crucial in maintaining fairness and accuracy in algorithm predictions. Fairness refers to the ability of the algorithm to predict minority and majority classes with high accuracy, thus ensuring that the algorithms learn equitably from both class distributions. Particularly in Category 2, the application of SMOTE escalated recall from 0.83 to 0.93 during testing, and the F1 score improved to 0.94, underscoring the effectiveness of this approach in addressing skewed data distributions.
For Categories 4, 5, 6, and 10, where cases were fairly balanced with positive cases ranging between 20 and 30%, targeted interventions such as SMOTE learning for positive and negative examples. This improved the models’ accuracy and fairness, as seen in Category 4, where SMOTE raised testing precision and recall to 0.90 each. Conversely, categories that were near balanced or highly balanced—such as Category 1 —required fewer or no intensive balancing interventions. In these instances, even minor data augmentation strategies optimized performance. For instance, in Category 1, general data augmentation improved testing accuracy to 0.95 and elevated the F1 score to 0.95, demonstrating that subtle modifications in data handling could lead to significant gains in algorithm performance across various metrics.

RQ3: Where do Human and Machine Disagree when Analyzing Scientific Models?

We conducted constant comparative thematic analysis to identify instances where machine and human scores diverged, aiming to investigate the areas of disagreement between human and machine evaluations. Building on the findings reported for RQs 1 and 2, we identified several themes across categories and cases with discrepant scores.

Challenges in Interpreting Free-Style Drawing Models

Although the system was designed so that students could use stamps instead of free-style drawing, some students still chose to draw freely. In examining where AI and human ratings disagree with each other in analyzing student scientific models, we identified a significant challenge in interpreting students' free-style drawings. See the sample student responses in Fig. 5. Across all 13 categories, discrepancies between machine and human ratings were predominantly found in the interpretation of free-style drawing models. While drawings is an effective means of presenting creativity and higher-order thinking skills, they can present substantial challenges in automated scoring systems due to broad diversity. This discrepancy highlights the limitations of automated scoring systems in dealing with unstandardized and highly individualized student responses. Specifically, automated scoring systems often fail to accurately recognize and value expressions that human experts consider creative or insightful. For instance, in cases involving symbolic or abstract representations (see models in Fig. 5a, b, and c), automated systems might assign lower scores due to their inability to understand the context and deeper meanings of these visual elements. Moreover, the consistency and reliability of machine scoring are challenged by the diverse visual styles students use to express the same scientific concept. For instance, for category 1, students have different approaches to show charges on the rod using their own approaches instead of using the pre-designed icons as part of the online curriculum system interface (e.g., 5a and 5b), which increases the likelihood of the machine's algorithm to appropriately analyze the ideas. For most of the models in the dataset, students used the pre-designed icons to indicate important features relevant to the scenario. These provide a more consistent representation for the AI algorithm to learn but also result in hand-drawn elements for the same features that are more diverse and less represented within the same dataset. Figure 5 shows a subset of models that include hand-drawn features to represent various model elements, with some models also including pre-defined elements. For instance, model 5c utilizes pre-designed arrow icons and hand-drawn arrows in orange to depict the repulsive forces between the foil leaves due to like charges, whereas model 5b employs drawn lines to represent fields resulting from charges. Although these hand-drawn features might have similar sketch lines, they convey different meanings in scientific models, which could influence the machine's interpretation.
Fig. 5
Students’ diverse approach of free-style modeling
Full size image

Challenges in Analyzing Mixed Charge Representations

The second significant theme that emerged is the mixed representation of both neutral, negative and positive charges in students' models (see Fig. 6). This theme encompasses various situations. For instance, Fig. 6a shows one situation where the model includes both types of charges, but each scenario features only one type of charge—neutral in scenario A and negative in scenario B. In Fig. 6b, within each scenario, the model includes both types of charges, but similar charges are located within discrete parts of the electroscope. In contrast, Fig. 6c presents a situation where the model displays both types of charges across scenarios a and b, with the two types of charges mixed across different parts of the electroscope. Variations in how students depict these charges can lead to significant discrepancies in machine versus human scoring. The difficulty lies in the machine's ability to interpret these mixed representations, often due to a lack of sufficiently varied examples in the training data. This theme is especially pronounced in categories 11 through 13, designed to understand students' incomplete or inaccurate ideas. In these categories, the machine should be able to recognize the placement of charges, the type of charge, and whether the type of charge is consistent throughout the model. Thus, the AI model must synthesize diverse information to determine the correct scores, which can be challenging for the automated system because these student ideas are not well-represented in the training set. Take, for example, category 11, which focuses on models showing both types of charges in various parts of the electroscope across different scenarios. However, this category includes an exceptional situation when human coders analyze student models; they realize that “This can be ignored if positive and negative charges are not accumulated in specific locations.”, a nuance that automated systems frequently miss. This and similar cases in categories 11 to 13 highlight the importance of designing coding rubrics that capture or identify these unusual ideas from students’ models rather than excluding this specific type of response. These responses showcase students' diverse ideas and plausible misunderstandings, which teachers need to explore further to provide appropriate support. For AI, the limited number of such responses results in insufficient data for training, leading to suboptimal performance. To ensure these students' ideas are not excluded, it is crucial to have a rubric that captures these responses, perhaps in another defined rubric category, and to use data augmentation to enhance these cases for further AI algorithm development. Enhancing the dataset in this way is crucial for improving the accuracy and consistency of machine assessments and ensuring that automated systems can effectively support an equitable and effective educational assessment environment.
Fig. 6
Students’ mixed charges representation in models
Full size image

Challenges in Analyzing Ambiguously Located Charges

A third significant theme identified in our thematic analysis involves the unclear depiction of charge locations in student models. This ambiguity poses a substantial challenge for the algorithms, which rely on clear indicators to assess and credit student responses accurately. This theme emerged particularly in scenarios where students were expected to demonstrate their understanding of charge distribution and its effects (e.g. categories 1–4 and 6–9). In many instances, students' models did not clearly specify the locations of positive or negative charges, leading to uncertainty and error in machine scores. In cases where charges should be distinctly marked to indicate the charges’ location on rod, metal ball, hook, or foil leaves (Fig. 7), vague or ambiguous placement made it difficult for the machine to determine whether the student's response was correct or flawed. For example, the student model in 7a shows a point charge between the ball and the hook of the electroscope device. Similarly, student model 7b shows point charges near and in between the foil leaves. These examples may even lead to disagreements between humans about whether a point charge is located “on” a specific part of the device as opposed to “near” a given part. This lack of clarity often resulted in the machine either wrongly crediting or failing to credit responses that might have been recognized by human scorers who could infer the intended meaning from the context or subtle cues within the drawings.
Fig. 7
Ambiguously located charges
Full size image

Challenges in interpreting Variability in Icon Size

A fourth theme emerged regarding the variability in icon size used to represent charge strengths. Student responses across categories 6 to 10 them to analyze the effects of increased charges causing the foil leaves to diverge further, necessitating an understanding of relative charge amounts between the two scenarios given in the item. The diversity in icon size not only reflects students' attempts to quantify charge visually but highlights a significant challenge for automated scoring systems. Students often employ varying sizes of icons to denote different intensities of charges, intuitively using larger icons for greater number of charges (see Fig. 8). For example, the model shown in 8a uses the same number of charges but larger icons for positive charge on the leaves in scenario B than scenario A. One interpretation of this model is that the larger size icons represent a more substantial charge on the leaves in scenario B. This method, while effective for human interpretation, presents a unique challenge to machines, particularly when combined with textual annotations that provide additional context or explanations.
Fig. 8
Icon size variability in charge representation
Full size image

Limitations and Future Research Directions

This study has limitations. First, while our case study approach effectively highlights instances where AI scoring diverges from human evaluation, we acknowledge that a deeper analysis of how neural network-derived feature representations align with human scoring rationales is still needed. Due to the black-box nature of deep learning models, explaining why certain responses receive lower scores remains a challenge. Feature attribution methods such as SHAP or dimensionality reduction techniques (e.g., t-SNE) may provide further insights into the underlying patterns AI prioritizes. Future studies could explore these techniques to better understand AI’s decision-making in assessing student responses. Second, our findings reflect the electroscope task’s specific cognitive demands; future work should test architectures like vision transformers on biology/ and chemistry models. Future research should also explore interpretable AI techniques to enhance transparency in automated scoring systems. SHAP (SHapley Additive Explanations) and t-SNE (t-distributed Stochastic Neighbor Embedding) could be applied to visualize feature importance and understand how AI models form evaluative decisions in relation to human scoring standards. Additionally, studies could investigate whether bias exists in AI-generated scores across different student demographics, ensuring that automated systems do not inadvertently disadvantage underrepresented student groups. By integrating such methods, we can further refine AI-assisted assessment models and improve their alignment with human evaluation standards.

Conclusions and Discussions

This study investigates the development and validation of deep learning AI algorithms in automatically analyzing student scientific models using analytic rubrics. We use a combination of tenfold cross-validation during training and independent testing phases to ensure robust analysis of the AI outcomes. Our findings validate the accuracy of deep learning AI algorithms in closely mirroring human judgments with high accuracy, precision, recall, and F1 scores, demonstrating its potential as a reliable tool in educational settings. We also explore how to address some challenges of automatically analyzing student scientific models regarding the complex nature of the cognitive construct and the imbalanced data structures. We employed three different analytic strategies, including no data augmentation, general data augmentation and SMOTE –- to process the student models. These strategies aimed to address data imbalance across different categories of student responses, enhancing the fairness and efficacy of the automated scoring systems. Moreover, we identified discrepancies between AI and human evaluation, particularly in interpreting complex or creatively expressed student models. These insights emphasize the need for ongoing enhancement of AI technologies to more appropriately handle the complexities of educational assessments, ensuring these approaches effectively support a wide range of student expressions and cognitive abilities.
These findings contribute meaningfully to the existing body of knowledge by demonstrating the efficacy of deep learning AI in educational assessments and highlighting areas where AI technology needs further refinement to better align with human judgment. This study lays the foundation for future research to explore more sophisticated AI techniques and training strategies that could enhance the performance of automated scoring systems, ensuring they provide fair and supportive feedback to effectively foster student learning and creativity. Additionally, this study initiates future work about how human involvement in the machine training loop (see Fig. 4) can achieve corresponding goals, potentially having a more substantial effect on the outcomes of the automated system.

Deep Learning AI in Automated Analysis of Student Scientific Models

Our research demonstrates that deep learning AI can replicate human judgment with high fidelity across accuracy, precision, recall, and F1 scores, reinforcing the potential of AI in enhancing educational assessments (Lee et al., 2021; Liu et al., 2016; Wilson et al., 2024). Despite the high accuracy rates ranging from 87 to 97% during the training phase, the study identified significant variability in accuracy during testing, especially in categories with high data imbalance. These finding calls attention to the challenges in generalizing AI models to new, diverse data sets, resonating with concerns raised by Li, Adah Miller, and He (2024) about the limitations of traditional metrics like accuracy in imbalanced scenarios. To counteract this, we emphasize the importance of alternative metrics such as precision, recall, and F1 scores, which provide a more nuanced understanding of model performance, particularly in handling minority class instances.
The study also explores the role of AI strategies, particularly data augmentation and SMOTE, in mitigating the effects of data imbalance, which are common in educational data sets. We found that AI strategies can enhance machine performance, especially where data imbalances might otherwise skew assessment outcomes. By tailoring AI interventions to specific imbalances, our algorithms showed significant improvements in key performance metrics, supporting the conclusions of Zhai, He, & Krajcik (2022) on the adaptive capabilities of AI. This strategic customization enhances the fairness and robustness of the resulting machine algorithm and ensures that AI can adapt to the complexities of educational data, thereby supporting more accurate and equitable evaluations of student models. These findings align with prior studies that have applied SMOTE in high-stakes educational assessments (Abu Zohair, 2019) and computer vision for student engagement analysis (Khan et al., 2021). While SMOTE was instrumental in improving performance for highly imbalanced categories, we also recognize its limitations: it does not inherently address potential conceptual differences in synthetic samples. Future research should explore complementary techniques such as adaptive synthetic sampling (ADASYN) (He et al., 2008) or generative data augmentation methods (Goodfellow et al., 2014) to further refine model generalizability.
This study focuses on developing a robust and reliable automated analysis system for evaluating student scientific models. The primary objective is not merely to enhance neural network performance but to ensure that the system reliably analyze student responses across diverse data distributions. Addressing data imbalance is critical in this context, as models trained on imbalanced datasets tend to disproportionately favor majority-class responses while failing to recognize nuanced ideas present in underrepresented student groups. To mitigate this issue, we applied SMOTE not simply to improve standard classification metrics (e.g., accuracy, recall, and F1 score) but to enhance the algorithm’s stability and generalizability in assessing a broad spectrum of student responses.
Importantly, SMOTE did not alter the scoring criteria/rubric; rather, it increased the diversity and balance of training samples, allowing the model to more accurately approximate human evaluative standards. For example, in Category 8, where positive cases comprised less than 10% of the data, the recall rate improved from 0.70 to 0.86 after SMOTE augmentation (see Table 4). This improvement indicates that the model became significantly more effective at identifying valid yet infrequent student responses that human raters had already recognized. Ensuring that automated scoring aligns with human judgment, particularly for rare but meaningful responses, is essential for achieving fair and accurate evaluations.
Regarding concerns about potential bias introduced by synthetic data augmentation, our approach ensured that SMOTE-generated samples adhered strictly to the original rubric criteria and human-scored ground truth. Prior research (Abu Zohair, 2019; Chawla et al., 2002) has demonstrated that SMOTE, when applied in educational settings, can improve fairness by reducing the model's tendency to overfit to majority-class responses. Additionally, our post-SMOTE case analyses (Figs. 5, 6, 7 and 8) confirmed that the model's evaluations remained consistent with human scores across both majority and minority-class responses, indicating no distortion of evaluation criteria.
While this study does not explicitly measure fairness metrics such as demographic parity or equalized odds, future research should further examine the impact of SMOTE on subgroup performance to ensure that no unintended biases emerge. Nonetheless, within the scope of this work, SMOTE proved instrumental in developing an automated scoring system that fair evaluates all student responses, rather than disproportionately favoring the most common ones. By addressing dataset imbalances, our approach enhances the reliability of AI-empowered assessments while maintaining alignment with human evaluative standards.
By analyzing students' models, we gain insights into their grasp of core disciplinary ideas, like charge distribution and static equilibrium. For instance, students often represented charges on the rod in the initial state (category 1) and often indicated differences in forces as the electroscope leaves moved (categories 5 & 10). By identifying these strengths in student models, one can provide instructional supports that extend these ideas to other elements and relationships in the model. On the other hand, students who struggled with charge placement often demonstrated misunderstandings about electrostatic interactions or an incomplete understanding of static equilibrium principles. These findings suggest that targeted teaching strategies, such as scaffolding students' understanding of these concepts, could address these gaps and improve scientific modeling practices. The study also demonstrates how AI-based analysis can provide targeted feedback on specific aspects of students' models. For example, AI systems can identify evidence of proficiency, such as accurate charge placement and equilibrium representations, and highlight areas needing improvement, like ambiguous charge locations. This aligns with the broader goal of using assessments to support learning by providing actionable insights to improve students' scientific understanding.

Insights for Automated Assessment System Design

This study has highlighted discrepancies between human and machine assessments of student scientific models, particularly freestyle drawing interpretations, variability in icon size, mixed charge representations, and ambiguous charge locations. The first two types of challenges—freestyle drawing interpretations and variability in icon size—are generic features and commonly occur in student modeling in science. These challenges may apply to other automated system designs. The last two themes—mixed charge representations and ambiguous charge locations—are likely more task-specific challenges related to explicit task measurement purposes within this study. Specifically, these last two themes more accurately relate to modeling proficiency. Students not proficient in modeling will likely place charges ambiguously or use alternative ways of representing charge magnitude. For instance, in our modeling task for the electroscope, we particularly care whether students demonstrate their understanding by placing charges in specific locations on the rod and electroscope. We also found that the image recognition algorithms frequently misinterpret models with mixed charge representations due to a lack of varied examples in the training data, highlighting a gap in the AI's learning that can lead to significant scoring inaccuracies. Moreover, the challenge of ambiguously located charges in student models reveals the AI's limited ability to utilize contextual clues that human experts might employ to infer appropriate interpretations. Other assessment tasks with similar goals (e.g., labeling of specifically located elements or using differing charges) may also need to consider these challenges when designing their systems or rubrics. Additional studies will determine whether these ways of modeling relate to a lower understanding of the content (charges specifically), a lack of proficiency in modeling practice, or both.
We also identified two other themes from discrepant scores, challenges resulting from student hand-drawn components, and icon sizes. These are likely relevant in applying AI evaluation to a broad range of modeling tasks in science assessment. These discrepancies point out the challenges AI faces in accurately analyzing complex and creative student responses that are standard in scientific modeling tasks. For instance, AI systems struggle to understand the nuanced and contextual elements of free-style drawings, often failing to recognize the creativity and depth such representations may convey compared to human raters. The issue with scoring hand-drawn models is twofold. First, AI often fails to recognize proficiency in detailed and accurate hand-drawn models because it cannot interpret the creativity and depth that human raters can. Second, the variability in icon size students use to represent charge strengths often leads to misinterpretations by AI systems. AI may ignore the implied intensity differences intended by the students, resulting in inaccurate assessments. Therefore, the discrepancies highlight the need to define what constitutes proficiency in student hand-drawn models for both AI and human raters. Some hand-drawn models show high proficiency and detail but are still not scored accurately by AI. This indicates that the problem is not just the ambiguity or lack of proficiency in the models but also the AI's limitations in evaluating them. However, recent AI approaches are increasingly more adept at interpreting handwritten text and hand drawn sketches and should continue to be explored in the context of scientific practices like modeling (see below; Xu et al., 2022).
Based on the outcomes of this study, in which students used a set of electronic drawing tools, models that consistently used the supplied pre-defined elements were mis-scored less often by the AI system. Thus, some constraints to completely free drawings seem beneficial for later AI scoring applications. However, using these tools could constrain students’ ability to draw and model as they desire and limit their representations and creativity. Thus, some balance between constrained electronic tools and free-hand drawings should be realized. To address the nuanced challenges identified, AI algorithms must incorporate advanced data interpretation capabilities. This could involve integrating machine learning techniques for multimodal data processing and context-sensitive analysis. As educational assessments increasingly utilize AI tools, ensuring these technologies can support and value student creativity in developing scientific models becomes critical. This involves technological advancements in AI and adjustments in educational practices to accommodate and nurture diverse expressions of student understanding.
These findings suggest an urgent need for enhancing AI algorithms to better handle the complexities inherent in student-generated scientific models. This includes expanding training datasets to encompass a wider variety of student responses and integrating more sophisticated, context-aware machine learning techniques that can adapt to the diverse ways students express their scientific understanding. Such advancements are crucial for improving the accuracy and fairness of AI assessments and ensuring that AI-supported educational tools can enhance learning outcomes by accurately evaluating and supporting all students’ work.
This study contributes to the broader field of educational technology by highlighting specific areas where AI application in science education can be refined and by providing an explicit analysis of the limitations current systems face, thereby guiding future developments in AI-assisted educational assessments. It not only demonstrates AI's capacity to replicate human-like judgment but also reveals critical gaps in its ability to process complex and creative student inputs. By addressing these gaps, future research can pave the way for more reliable, fair, and innovative uses of AI in education, ensuring that it supports a broad spectrum of student abilities and expressions.

Rationale for using Deep Learning approaches instead of GenAI

Generative AI (GenAI) models, such as Gemini and GPT-4, have demonstrated significant potential in assessing multimodal inputs, including handwritten work, as highlighted in previous studies (Kortemeyer et al., 2025). Despite their strengths, GenAI models face notable challenges, particularly with data privacy issue, stability and consistency. For example, their outputs can vary significantly based on the phrasing of prompts, which poses challenges for reliable performance in specific, high-stakes educational contexts. In this study, we prioritized the development of a stable and consistent algorithm capable of accurately analyzing student models across diverse datasets and instructional contexts. To achieve this goal, we selected a CNN-based deep learning approach. This method enabled task-specific tuning and demonstrated superior reliability in detecting and analyzing detailed features within students’ scientific models. While GenAI models remain promising for handling diverse and unstructured tasks, the specific requirements of our study—focused on physics-based modeling proficiency within a predefined curriculum—favored the use of a Deep Learning approach for its stability and adaptability. We encourage further research to explore the applications of GenAI in science education, particularly in handling multimodal inputs (Yang et al., 2024). However, in the context of this study, the DL approach offered a more contextually appropriate and reliable solution for achieving our research objectives.

Declarations

Ethical Approval

This study was approved by Michigan State University IRB. All procedures performed in studies involving human participants were in accordance with the ethical standards of the research committee.
.

Research Involving Human Participants and/or Animals

This research involved human participants, and no animals were used in this study. All procedures performed were in accordance with the ethical standards of the Michigan State University IRB.
All participants gave their informed consent to participate in this study.
Participants provided their consent for the publication of anonymized data.

Competing Interests

The authors declare that they have no competing interests.
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Title
Utilizing Deep Learning AI to Analyze Scientific Models: Overcoming Challenges
Authors
Tingting Li
Kevin Haudek
Joseph Krajcik
Publication date
01-04-2025
Publisher
Springer Netherlands
Published in
Journal of Science Education and Technology / Issue 4/2025
Print ISSN: 1059-0145
Electronic ISSN: 1573-1839
DOI
https://doi.org/10.1007/s10956-025-10217-0
go back to reference Abu Zohair, L. M. (2019). Prediction of student’s performance by modelling small dataset size. International Journal of Educational Technology in Higher Education, 16(1), 27. https://doi.org/10.1186/s41239-019-0160-3CrossRef
go back to reference Bejar, I. I., Williamson, D. M., & Mislevy, R. J. (2006). Human scoring. In D. M. Williamson, I. I. Bejar, & R. J. Mislevy (Eds.), Automated scoring of complex tasks in computer-based testing (pp. 49–81). Lawrence Erlbaum Associates.
go back to reference Bejar, I. I., Braun, H., & Tannenbaum, R. (2007). A prospective, predictive, and progressive approach to standard setting. In R. W. Lissitz (Ed.), Assessing and modeling cognitive development in school: Intellectual growth and standard setting (pp. 1–30). JAM Press.
go back to reference Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357. https://doi.org/10.1613/jair.953CrossRef
go back to reference Fukushima, K., & Miyake, S. (1982). Neocognitron: A new algorithm for pattern recognition tolerant of deformations and shifts in position. Pattern Recognition, 15(6), 455–469.CrossRef
go back to reference Gilbert, J. K. (2004). Models and modelling: Routes to more authentic science education. International Journal of Science and Mathematics Education, 2(2), 115–130.CrossRef
go back to reference Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 580–587). IEEE. https://doi.org/10.1109/CVPR.2014.81
go back to reference Glaser, B. G. (1965). The constant comparative method of qualitative analysis. Social Problems, 12(4), 436–445.CrossRef
go back to reference Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press.
go back to reference Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adversarial nets. In Advances in Neural Information Processing Systems (p. 27). Retrieved March 16, 2025, from https://arxiv.org/abs/1406.2661
go back to reference Haudek, K. C., Prevost, L. B., Moscarella, R. A., Merrill, J., & Urban-Lurain, M. (2012). What are they thinking? Automated analysis of student writing about acid–base chemistry in introductory biology. CBE—Life Sciences Education, 11(3), 283–293. https://doi.org/10.1187/cbe.12-02-0023CrossRef
go back to reference Haudek, K. C., & Zhai, X. (2023). Examining the effect of assessment construct characteristics on machine learning scoring of scientific argumentation. International Journal of Artificial Intelligence in Education. https://doi.org/10.1007/s40593-023-00385-8CrossRef
go back to reference He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263–1284. https://doi.org/10.1109/TKDE.2008.239CrossRef
go back to reference He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 770–778). IEEE. https://doi.org/10.1109/CVPR.2016.90
go back to reference He, P., Chen, I. C., Touitou, I., Bartz, K., Schneider, B., & Krajcik, J. (2023). Predicting student science achievement using post-unit assessment performances in a coherent high school chemistry project-based learning system. Journal of Research in Science Teaching, 60(4), 724–760. https://doi.org/10.1002/tea.21815CrossRef
go back to reference He, P., Shin, N., Kaldaras, L., spsampsps Krajcik, J. (2024). Integrating artificial intelligence into learning progression-based learning systems to support student knowledge-in-use: Opportunities and challenges. In H. Jin, D. Yan, spsampsps J. Krajcik (Eds.), Handbook of Research in Science Learning Progressions. https://doi.org/10.4324/9781003170785-31
go back to reference James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112). Springer.CrossRef
go back to reference Japkowicz, N., & Shah, M. (2011). Evaluating learning algorithms: A classification perspective. Cambridge University Press.CrossRef
go back to reference Kaldaras, L., Akaeze, H., & Krajcik, J. (2021). Developing and validating Next Generation Science Standards-aligned learning progression to track three-dimensional learning of electrical interactions in high school physical science. Journal of Research in Science Teaching, 58(4), 589–618. https://doi.org/10.1002/tea.21670CrossRef
go back to reference Kaldaras, L., Akaeze, H. O., & Krajcik, J. (2023). Developing and validating a next generation science standards-aligned construct map for chemical bonding from the energy and force perspective. Journal of Research in Science Teaching. https://doi.org/10.1002/tea.21790CrossRef
go back to reference Kaldaras, L., & Haudek, K. C. (2022). Validation of automated scoring for learning progression-aligned Next Generation Science Standards performance assessments. In Frontiers in Education (vol. 7, p. 968289). Frontiers Media SA. https://doi.org/10.3389/feduc.2022.968289
go back to reference Kaldaras, L., Li, T., Haudek, K. C., & Krajcik, J. (2024). Developing rubrics for AI scoring of NGSS learning progression-based scientific models. In AERA Annual Conference. https://doi.org/10.3102/2109181
go back to reference Kaldaras, L., Yoshida, N. R., & Haudek, K. C. (2022, November). Rubric development for AI-enabled scoring of three-dimensional constructed-response assessment aligned to NGSS learning progression. In Frontiers in education (Vol. 7, p. 983055). Frontiers Media SA.
go back to reference Krajcik, J., Schneider, B., Miller, E. A., Chen, I. C., Bradford, L., Baker, Q., Bartz, K., Miller, C., Li, T., Codere, S., & Peek-Brown, D. (2023). Assessing the effect of project-based learning on science learning in elementary schools. American Educational Research Journal, 60(1), 70–102. https://doi.org/10.3102/00028312221100896
go back to reference Li, T., Reigh, E., He, P., & Adah Miller, E. (2023). Can we and should we use artificial intelligence for formative assessment in science? Journal of Research in Science Teaching, 60(6), 1385–1389. https://doi.org/10.1002/tea.21780CrossRef
go back to reference Li, T., Miller, E., Chen, I. C., Bartz, K., Codere, S., & Krajcik, J. (2021). The relationship between teachers’ support of literacy development and elementary students’ modelling proficiency in project-based learning. Journal of Science Education and Technology, 30(5), 694–708. https://doi.org/10.1007/s10956-021-09900-9CrossRef
go back to reference Li, T., Adah Miller, E., & He, P. (2024). Culturally and linguistically “blind” or biased? Challenges for AI assessment of models with multiple language students. In Proceedings of the Annual Meeting of the International Society of the Learning Sciences (ISLS).
go back to reference Li, T., Chen, I. C., Adah Miller, E., Miller, C. S., Schneider, B., & Krajcik, J. (2024). The relationships between elementary students’ knowledge-in-use performance and their science achievement. Journal of Research in Science Teaching, 61(2), 358–418. https://doi.org/10.1002/tea.21820CrossRef
go back to reference Khan, I., Ahmad, A. R., Jabeur, N., & Mahdi, M. N. (2021). An artificial intelligence approach to monitor student performance and devise preventive measures. Smart Learning Environments, 8, 1–18. https://doi.org/10.1186/s40561-021-00161-yCrossRef
go back to reference Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint, arXiv:1412.6980. Retrieved March 16, 2025, from http://arxiv.org/abs/1412.6980
go back to reference Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. In Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI) (vol. 14, no. 2, pp. 1137–1145). Morgan Kaufmann.
go back to reference Kortemeyer, G., Babayeva, M., Polverini, G., Gregorcic, B., & Widenhorn, R. (2025). Multilingual performance of a multimodal artificial intelligence system on multisubject physics concept inventories. arXiv preprint, arXiv:2501.06143. Retrieved March 16, 2025, from http://arxiv.org/abs/2501.06143
go back to reference Kress, G. (2009). Multimodality: A social semiotic approach to contemporary communication. Routledge.CrossRef
go back to reference Krippendorff, K. (2011). Computing Krippendorff’s alpha reliability. University of Pennsylvania ScholarlyCommons. Retrieved March 16, 2025, from https://repository.upenn.edu/asc_papers/43
go back to reference Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2017). ImageNet classification with deep convolutional neural networks. Communications of the ACM, 60(6), 84–90. https://doi.org/10.1145/3065386CrossRef
go back to reference LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324. https://doi.org/10.1109/5.726791CrossRef
go back to reference LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444. https://doi.org/10.1038/nature14539CrossRef
go back to reference Lee, H., Pallant, A., Pryputniewicz, S., Lord, T., Mulholland, M., & Liu, O. L. (2019). Automated text scoring and real-time adjustable feedback: Supporting revision of scientific arguments involving uncertainty. Science Education, 103(3), 590–622. https://doi.org/10.1002/sce.21504CrossRef
go back to reference Lemaître, G., Nogueira, F., & Aridas, C. K. (2017). Imbalanced-learn: A Python toolbox to tackle the curse of imbalanced datasets in machine learning. Journal of Machine Learning Research, 18(1), 559–563. Retrieved March 16, 2025, from http://jmlr.org/papers/v18/16-365.html
go back to reference Liu, O. L., Rios, J. A., Heilman, M., Gerard, L., & Linn, M. C. (2016). Validation of automated scoring of science assessments. Journal of Research in Science Teaching, 53(2), 215–233. https://doi.org/10.1002/tea.21299CrossRef
go back to reference Lu, W., & Tran, E. (2017). Free-hand sketch recognition classification. CS 231N Project Report. Retrieved March 16, 2025, from http://cs231n.stanford.edu/
go back to reference Mayo, M. (2024, May 24). Tips for handling imbalanced data in machine learning. Machine Learning Mastery. Retrieved from https://machinelearningmastery.com/tips-for-handling-imbalanced-data-in-machine-learning/. Accessed 16 Mar 2025.
go back to reference Microsoft. (2023). Prevent overfitting and imbalanced data with Automated ML. Microsoft Learn. Retrieved from https://learn.microsoft.com/en-us/azure/machine-learning/concept-automated-ml. Accessed 16 Mar 2025.
go back to reference Mislevy, R. J., & Haertel, G. D. (2006). Implications of evidence-centered design for educational testing. Educational Measurement: Issues and Practice, 25(4), 6–20. https://doi.org/10.1111/j.1745-3992.2006.00075.xCrossRef
go back to reference Nagidi, J. (2024). Best ways to handle imbalanced data in machine learning. Dataaspirant. Retrieved from https://dataaspirant.com/best-ways-handle-imbalanced-data-machine-learning. Accessed 16 Mar 2025.
go back to reference Namdar, B., & Shen, J. (2015). Modeling-oriented assessment in K-12 science education: A synthesis of research from 1980 to 2013 and new directions. International Journal of Science Education, 37(7), 993–1023. https://doi.org/10.1080/09500693.2015.1012185CrossRef
go back to reference National Academies of Sciences, Engineering, and Medicine. (2019). Science and engineering for grades 6–12: Investigation and design at the center. The National Academies Press. https://doi.org/10.17226/25216
go back to reference National Research Council. (2000). How people learn: Brain, mind, experience, and school: Expanded edition. The National Academies Press. https://doi.org/10.17226/9853
go back to reference National Research Council. (2001). Knowing what students know: The science and design of educational assessment. The National Academies Press. https://doi.org/10.17226/10019
go back to reference National Research Council. (2006). Systems for state science assessment. National Academies Press.
go back to reference National Research Council. (2012). A framework for K-12 science education: Practices, crosscutting concepts, and core ideas. National Academies Press.
go back to reference National Research Council. (2014). Developing assessments for the next generation science standards. National Academies Press.
go back to reference Nehm, R. H., Ha, M., & Mayfield, E. (2012). Transforming biology assessment with machine learning: Automated scoring of written evolutionary explanations. Journal of Science Education and Technology, 21, 183–196. https://doi.org/10.1007/s10956-011-9300-9CrossRef
go back to reference NGSS Lead States. (2013). Next generation science standards: For states, by states. The National Academies Press.
go back to reference Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., ... & Chintala, S. (2019). PyTorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems, 32, 8024–8035.
go back to reference Powers, D. M. W. (2011). Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation. International Journal of Machine Learning Technology, 2(1), 37–63.
go back to reference Schwarz, C. V., Reiser, B. J., Davis, E. A., Kenyon, L., Achér, A., Fortus, D., ... & Krajcik, J. (2009). Developing a learning progression for scientific modeling: Making scientific modeling accessible and meaningful for learners. Journal of Research in Science Teaching, 46(6), 632–654. https://doi.org/10.1002/tea.20311
go back to reference Schwarz, C. V., Passmore, C., & Reiser, B. J. (2017). Moving beyond “knowing about” science to making sense of the world. Helping students make sense of the world using next generation science and engineering practices (pp. 3–21). NSTA Press.
go back to reference Sokolova, M., & Lapalme, G. (2009). A systematic analysis of performance measures for classification tasks. Information Processing & Management, 45(4), 427–447. https://doi.org/10.1016/j.ipm.2009.03.002CrossRef
go back to reference Sowjanya, A. M., & Mrudula, O. (2023). Effective treatment of imbalanced datasets in health care using modified SMOTE coupled with stacked deep learning algorithms. Applied Nanoscience, 13(3), 1829–1840. https://doi.org/10.1007/s13204-022-02602-1CrossRef
go back to reference Van Asch, V., & Daelemans, W. (2016). Predicting the effectiveness of self-training: Application to sentiment classification. arXiv preprint, arXiv:1601.03288. Retrieved from http://arxiv.org/abs/1601.03288. Accessed 16 Mar 2025.
go back to reference Wilson, C. D., Haudek, K. C., Osborne, J. F., Buck Bracey, Z. E., Cheuk, T., Donovan, B. M., ... & Zhai, X. (2024). Using automated analysis to assess middle school students' competence with scientific argumentation. Journal of Research in Science Teaching, 61(1), 38–69. https://doi.org/10.1002/tea.21904
go back to reference Windschitl, M., Thompson, J., & Braaten, M. (2008). Beyond the scientific method: Model-based inquiry as a new paradigm of preference for school science investigations. Science Education, 92(5), 941–967. https://doi.org/10.1002/sce.20259CrossRef
go back to reference Xu, P., Hospedales, T. M., Yin, Q., Song, Y.-Z., Xiang, T., & Wang, L. (2023). Deep learning for free-hand sketch: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1), 285–312. https://doi.org/10.1109/TPAMI.2022.3148853CrossRef
go back to reference Yang, K., Chu, Y., Darwin, T., Han, A., Li, H., Wen, H., Copur-Gencturk, Y., Tang, J., & Liu, H. (2024). Content knowledge identification with multi-agent large language models (LLMs). In A. M. Olney, I.-A. Chounta, Z. Liu, O. C. Santos, & I. I. Bittencourt (Eds.), Artificial intelligence in education (pp. 284–292). Springer Nature Switzerland. https://doi.org/10.1007/978-3-031-33464-9_23
go back to reference Zhai, X., He, P., & Krajcik, J. (2022). Applying machine learning to automatically assess scientific models. Journal of Research in Science Teaching, 59(10), 1765–1794. https://doi.org/10.1002/tea.21723CrossRef
Image Credits
in-adhesives, MKVS, Ecoclean/© Ecoclean, Hellmich GmbH/© Hellmich GmbH, Krahn Ceramics/© Krahn Ceramics, Kisling AG/© Kisling AG, ECHTERHAGE HOLDING GMBH&CO.KG - VSE, Schenker Hydraulik AG/© Schenker Hydraulik AG