Skip to main content
Top

Open Access 16-09-2024 | HAUPTBEITRAG

Strengths and weaknesses of automated scoring of free-text student answers

Authors: Marie Bexte, Andrea Horbach, Torsten Zesch

Published in: Informatik Spektrum

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Free-text tasks, where students need to write a short answer to a specific question, serve as a well-established method for assessing learner knowledge. To address the high cost of manually scoring these tasks, automated scoring models can be used. Such models come in various types, each with its own strengths and weaknesses. Comparing these models helps in selecting the most suitable one for a given problem. Depending on the assessment context, this decision can be driven by ethical or legal considerations. When implemented successfully, a scoring model has the potential to substantially reduce costs and enhance the reliability of the scoring process. This article compares the different categories of scoring models across a set of crucial criteria that have immediate relevance to model employment in practice.
Notes

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
We focus on scoring free-text answers, which may originate directly from typed data or can be transcribed from handwritten or spoken responses. Depending on the aim of teachers, the scoring may focus on language quality or factual correctness. In the field of educational natural language processing, evaluation mainly targeting factual correctness is referred to as short-answer scoring or content scoring [1, 2], while evaluation that also targets language quality is commonly referred to as essay scoring [3]. In this article, we focus on free-text content scoring. Models for this task can roughly be grouped into four categories. We compare the different model types along a set of key dimensions. This concise comparison of their strengths and weaknesses allows for an informed assessment of the suitability of each method for a given application scenario.

Potential benefits of automated content scoring

For centuries, the traditional way of assessing answers was for teachers to manually score them, which is a rather tedious and stressful task [4]. It consumes time that teachers could otherwise allocate to different responsibilities. Employing an automated model to score learner answers could, therefore, free up these resources. It may also facilitate giving timely feedback to a large group of learners. From the learner’s perspective this opens the door to iteratively answering the same question multiple times until the correct solution is found, each time receiving immediate evaluation [5].
Automated scoring usually builds on a set of manually evaluated answers and learns the association between answers and their scores. Ideally, this creates a model that is able to generalize to new, unseen answers. Automated scoring may also improve reliability issues arising with manual scoring. For example, in manual scoring, the order in which answers are evaluated [6] or the relationship between teacher and learner may influence evaluation outcomes. The effects of fatigue [7] or mood [8] can also negatively impact evaluation reliability. Additionally, different types of raters may have different rating tendencies [9], which can be another source of variance in human judgements. Automated scoring helps overcome some of these shortcomings: A computer does not get tired and is unaffected by mood or the order in which answers are scored [5]. Computer-assisted scoring may thus not only cut evaluation time significantly, but also comes with the promise of increasing the scoring reliability.

Challenges of automated content scoring

With the potential benefits of automated scoring, one must not lose sight of the remaining challenges. One is legal requirements, which differ depending on where the scoring is to take place. In Europe, the proposed Artificial Intelligence Act of the European Union [10] will regulate automated scoring with regard to its anticipated risk. In some cases, the current legal situation may prohibit automated scoring entirely, depending on ‘the margin of discretion in evaluating an answer’, which is usually rather high for free-text scoring. A more general legal aspect is to ensure that data is processed in line with the respective data protection regulations. Regarding the quality assurance of testing instruments, the American Educational Research Association (AERA), the American Psychological Association (APA) and the National Council of Measurement in Education (NCME) give a number of criteria in their Standards for Educational and Psychological Testing [11]. They can become legally binding when referenced in agreements.
When it comes to ethics, explainability of machine-given scores is a key issue. Due to their complexity, models often provide insufficient explanations for how they arrive at a given score [12]. To address this, assisted scoring systems have been proposed, where automatically generated score suggestions are presented to the teacher in order to have them make the final scoring decision. However, they pose the risk of automation bias, i.e. teachers overly relying on the machine-given suggestions [13, 14]. In general, both humans and machines are prone to exhibiting bias, but with machines the bias is systematic. This is an important distinction in the consideration of whether learners have influence on who evaluates them. In traditional manual evaluation, it is generally possible to enforce being evaluated by a different individual. Although a certain level of effort may be required, one could in extreme cases move to a different institution. If we assume the same scoring model to be employed widely across institutions, this may bring the benefit of consistent evaluation, but if this model were to systematically discriminate against a certain group of learners [15], affected individuals would be unable to escape from it.
Even if it is legally and ethically permissible, a remaining challenge is to ensure that the quality of automated scoring is within an acceptable reach of the performance of manual scoring [16]. As we argue in the next section, this is mainly influenced by answer variance.

Drivers of scoring difficulty

The difficulty of scoring a set of answers is mainly driven by the linguistic variance of the answers [17, 18]: To be able to derive how answers relate to their scores, models need to have observed a representative set of answers. Regarding answer content, Zesch, Horbach and Zehner [18] distinguish conceptual variance, realization variance, and nonconformity variance. Conceptual variance is inherent to the task, i.e. the set of concepts that constitute correct answers to the question, and common misconceptions of these. If we consider the example question “Name one of the three branches of government”, it is satisfyingly answered by either judicial branch, executive branch or legislative branch [18]. Realization variance is rooted in the language used to express an answer. It can stem from the use of implicit language [19], paraphrases or synonyms, but also from orthographical or grammatical errors. Taking lexical diversity as the combined manifestation of the different types of variance, one can demonstrate its negative relation to scoring performance [18]. Finally, nonconformity variance arises whenever students deviate from the prescribed task expectations by giving a nonsensical or off-topic response, such as I wasn’t there when we did this, or try to game the system, for example by just answering branch.
Another influence on scoring difficulty stems from the number of possible scores and their distribution. A higher number of possible outcomes (e.g. binary vs. 6 grades) usually leads to a more challenging scoring task [20]. At the same time, unequal label distributions lead to worse models [4]; e.g. even if there are only two labels (good/bad) it is impossible to learn a good model if we only see examples of one class in the training data. In general, to learn a good representation of a certain label, a classification model must have seen substantial evidence for it [18].

Types of content scoring models

Generally, approaches to automated content scoring can be distinguished into instance-based and similarity-based methods [17]. In the instance-based paradigm, learner answers are processed into a representation from which the model derives the scores. In similarity-based scoring, this score is derived from the similarity between a learner answer and one or more reference answers. These reference answers may be limited to ideal solutions [21] or may also try to cover answer variance [2224]. While traditionally, instance-based methods used to outperform similarity-based ones, sufficient adaptation of the similarity metric in recent models closes this gap [24, 25]. The distinction into instance-based and similarity-based approaches is orthogonal to the discrimination of a set of model types that we compare: rule-based, shallow, deep and generative. A key difference between the different model types is which kind of human intervention is needed at what point in the scoring process (see Fig. 1).

Rule-based models

In rule-based methods, the scoring model is manually defined by a domain expert and consists of a set of rules. Depending on which rules a specific answer satisfies, a score is assigned. This could be as easy as saying that if legislative occurs in an answer to our example question, the answer is labelled as correct. An obvious limitation of this rule is that it would also classify the answer Not legislative as correct. It can also not take spelling errors into account. While these issues may be circumvented with a more sophisticated set of rules, it is challenging to come up with a truly robust and efficient rule base that anticipates all of the many different answers learners may provide.

Shallow models

An approach that delegates some of the manual effort to machine learning is shallow models. With these, one first has to manually define a set of features that should be taken into account when evaluating answers. Such features could be the answer length or reflect which words or phrases occur in an answer. Each answer is then processed into its feature-based representation. Instead of the raw answer, it is this vector representation that the scoring model builds on. Therefore, the selection of features is crucial to the success of the model. A model is fit to the data, aiming to encapsulate the relation between answer features and scores. Afterwards, new answers can be processed into a predicted evaluation by first creating their feature-based representations and then feeding these to the model.

Deep classification models

A further machine learning-based approach is deep learning, where models can directly build on raw answer texts. While this drops the requirement to manually design features, it relinquishes control over what to pay attention to in the model. Deep models are neural networks that consist of multiple layers and are trained iteratively. Many of the more recent approaches build on a pre-trained model, such as variations of the popular BERT model [26]. These generic models already come with an understanding of language in general and are then fine-tuned to the specific task.

Deep prompting models

A version of deep learning with great recent popularity is using large generative language models such as GPT [27, 28]. Generative models do not just encode but also produce language and can be prompted to evaluate a learner answer. Still, in order for outputs to conform to a desired format, e.g. scores in a certain range, this information has to be presented to the model as well. Additionally, it may be beneficial to enrich the input further by including some manually labelled answers as examples, much like using training data to develop other machine learning models. Thus, it is this prompting that can be seen as a way of fine-tuning [29]. While it is possible to adapt the model itself, since these models are just a special case of deep models, we discuss them in the former sense of using prompting to adapt to the task.

Properties of scoring methods

Which model type is ideal depends on the circumstances of an application scenario, as each model type has its own strengths and weaknesses. Table 1 organizes the most important commonly discussed model properties into groups: those pertaining to model performance, architecture, effectiveness, and transferability to new tasks and languages. For comparison, we also include manual scoring. For each property, we derive a rating of each of the model types from the respective research, which we discuss in the subsequent paragraphs.
Table 1
Overview of model properties. Note that this is intended to serve as a rough guide and thus simplified into a three-level categorization. Special circumstances, tasks or algorithms may deviate from this rating, and a comparison across categories is not intended, i.e. one star when it comes to scoring cost is not equivalent to one star with respect to training cost
 
Manual
Rule-based
Shallow
Deep classification
Deep prompting
Model performance
Quality
★★★
★★
★★
★★★
★★
Reliability
★★
★★★
★★★
★★★
★★
Robustness
★★★
★★
★★
Model architecture
Degree of automation
★★
★★
★★★
★★
Feedback types
★★★
★★
★★
★★
★★★
Interpretability
★★★
★★★
★★
Efficiency
Data use
★★★
★★★
★★
Training cost
★★★
★★
★★★
Scoring cost
★★★
★★★
★★
★★
Transferability
Tasks
★★★
★★
★★★
Languages
★★
★★★
★★★

Model performance

In deciding which method to employ, the possible level of performance is a central criterion. Performance is, however, not just a matter of accuracy, as scoring should also be reliable and robust.
Quality
In judging the performance of an automated scoring method, manual scores serve as the gold standard; it is the human scores that scoring models are designed to approximate. However, even human evaluators make mistakes and are not perfectly reliable. Two different evaluators may thus not always give the same score, and even the same evaluator may give different scores on different days. Therefore, it is common practice to measure the level of agreement between human evaluators [30]. Human-human agreement is taken as an upper bound on machine-human agreement. How close an automated system has to come to this upper bound depends on the stakes of the assessment [18]. In the case of a summative large-scale assessment, a model that does not quite reach human-human level agreement may suffice to gauge overall trends in a large cohort of learners. If the aim is to provide feedback in a formative exercise, using this same model can be inappropriate, as erroneous scores could misdirect student learning.
In the evolution of content scoring, rule-based approaches were first replaced with shallow and then deep learning [3]. The overall trend is for deep learning models to come closest to human performance. The recency of generative models limits findings regarding their performance. There are indications of a fine-tuned ChatGPT model outperforming the standard deep learning method [31] and a small amount of labelled data sufficing for performance to come close to that of humans for some aspects [32]. Other studies, however, find that their performance is at most on par with other, more efficient scoring approaches [33]. Overall, more research is needed for a more comprehensive estimate of their scoring capabilities.
Reliability
An aspect where machines are superior to humans is reliability. As discussed, human scoring can be affected by fatigue or mood [7, 8]. Automated scoring models give the same evaluation when repeatedly presented with the same answer. Generative models are the only model type that we discuss where this may have to be qualified somewhat: These models are designed to not produce the same response to the same input being presented to them repeatedly, and, depending on the setup, the sequence of answers may influence evaluation. Thus, if we assume multiple students having given the same exact answer, there is no guarantee for them to be evaluated the same.
Robustness
While humans tend to be inferior when it comes to reliability, they are superior in terms of robustness, i.e. to what extent the scoring quality is dependent on factors other than the correctness of an answer. A form of noise that is unlikely to influence the factual correctness of answers are grammatical errors or misspelled words. Thus, scores should be unaffected by the presence of such errors. Since models rely on surface strings, misspellings do, however, have the potential of obscuring an answer to the point of influencing its evaluation [35]. Although normalization may mitigate this effect, it equally poses the risk of altering a response to the point where semantics of relevant aspects are affected [18]. However, research hints at shallow learning being relatively robust against spelling errors [35]. With deep learning, spelling errors prohibit the respective word from being mapped to its learned embedding. The use of character embeddings [34] or WordPiece [36] can mitigate this somewhat, as the latter break down words into sub-words, which opens the possibility for at least parts of a word to be mapped to semantically meaningful embeddings.
To a human, it usually does not matter if learners use synonyms or unusual sentence structures. Nonsense responses are fairly easy to identify. Scoring models, on the other hand, are trained on a certain range of answers, which sets the scope of language they are familiar with. If a learner innocently writes an answer in their own style, which might be unfamiliar to a model, this can produce unexpected evaluations [37]. Awareness of this risk could set the undesirable incentive for learners to play it safe and avoid, e.g. unconventional, creative formulations [5]. While rule-based and shallow models learn exclusively from the training data, deep learning models have the advantage of building on pre-trained models. This enables them to more easily pick up on generic semantic phenomena, such as the use of synonyms.
A more deliberate form of noise in the data can arise if learners are aware of the fact that they are being scored by a computer: They may try to fool the system into assigning a high score to a, potentially nonsensical, answer [3840]. In manual evaluation, an answer that contains the ‘right’ keywords but combines them into an incorrect answer is unlikely to get the full score. In a rule-based system, however, there could be a rule that defines answers containing a certain keyword as correct—irrespective of whether it occurs in the right context. Similarly, shallow methods may have difficulty recognizing this. While context can be encapsulated by answer features, it tends to be limited to a rather small context window. State-of-the-art deep learning models do not have this context limitation, which increases their ability to robustly judge even answers that try to fool the system.

Model architecture

Another set of properties where models differ originates from their architectures. We discuss how this affects the degree of automation, the kinds of feedback they can provide and how easy it is to interpret model predictions.
Degree of automation
Wanting to reduce manual effort is a central motivation to introduce automated scoring, and the level of manual effort is a core difference between the different model types. In rule-based scoring, rules have to be designed manually. Similarly, in shallow learning, a feature set has to be designed. Deep models learn directly from raw answers and thus delegate even more of the otherwise manual effort to the model. With generative deep models, there is again slightly more human intervention needed to decide on the optimal prompt to present to the model: How should the scoring rubric be included, and to what extent are exemplary answers needed to guide the model? These decisions can have substantial influence on the model output and thus require careful consideration [41]. Apart from the different needs for manual intervention, the various methods also require different types of human experts. Manual scoring is done by teachers. Teachers can also assist in designing a rule-based system and prompt a generative model, but the feature design in shallow learning requires an expert in automated scoring. Another difference is at which point of time manual intervention is needed. Manual scoring takes place after learner answers are collected, and a shallow or deep classification model is trained on a sample of already collected and manually scored answers. Since rule-based models do not depend on a training dataset, they can be prepared beforehand. Deep prompting models are ready without additional preparation.
Feedback types
Feedback can range from a single numeric score to an elaborate textual evaluation. Which kind of feedback is desired is to a certain extent linked to the aim of the assessment in general. In a summative evaluation, one might prefer to have a single score, perhaps derived as an aggregate of a set of sub-scores. In a formative setting on the other hand, the aim could be to have more elaborate textual feedback to present to the learner to -foster learning. When it comes to supporting outlier students, this kind of personalized feedback may be harder to accommodate using automated scoring [5]. All methods that we discuss here are able to predict numeric scores, and thus individual models can be learned for a set of multiple sub-scores. Giving a natural language justification for why an answer warrants a certain number of points is, however, not straightforwardly supported by rule-based, shallow or deep learning models. Using a generative model opens this possibility, but more research is needed to evaluate the quality of the feedback messages these models generate.
Interpretability
Another aspect where models differ is how easily one can understand how they came up with a prediction. Rule-based systems allow reconstruction of which rules lead to a certain evaluation and thus come with this level of explainability. In shallow machine learning, this is more obscure: The features that were abstracted from an answer and form the basis for the model are known. Still, the complexity of the resulting model makes it less straightforward to understand how these features lead to a predicted score. In deep learning, it is the raw text that forms the input for the model. Thus, the model itself learns how to process the learner text into a representation from which it is able to infer its evaluation. While this takes away the task of having to manually decide on a set of features, it also makes it harder to assess which aspects of an answer play which role in the models’ evaluation. There are efforts of using model-internal attention values as indicative of which words in a text were essential for a certain classification [42]. This is, however, still far off from fully understanding model predictions. The same applies to generative models. While one can prompt them to explain why they assign a certain score, this would rather be considered feedback for the learner. How this feedback came to be is again a matter of limited explainability, as the answer was produced by a rather obscure deep model.

Efficiency

A core motivation for employing automated scoring is the efficient use of resources. This covers the amount of data and time that is needed to even come up with a model and also the mere cost of running it and, crucially, how much time is needed to score answers once the model is employed.
Data use
While it is arguably desirable to have the best possible level of performance, one also has to take into account the effort that is required to reach it. Much of this centres around the volume of exemplary answers that is needed to derive a model. To manually evaluate answers, teachers are able to form a mental model of how many points to award which answer from merely reading a scoring rubric [43]. In rule-based scoring, one can design a set of rules without having any training data. Using prompting to adapt a pre-trained generative model requires a couple of examples. With shallow and deep classification models, a more extensive set of examples is needed to obtain a useful model. When it comes to the amount, the task has quite some influence: Even the same method reaches drastically different levels of performance with the same number of training instances across different tasks from the same dataset [18]. How much data is needed for a certain level of performance is also linked to the label distribution, the level of variance in the data and the match between training and test data [18]. Apart from these overarching determinants, deep learning typically requires more data and tends to give less stable results when too little data is used. Shallow methods tend to abstract better from the same small sample of training data [25].
Training cost
We need to distinguish between training a base model (large language model or teacher) or a task-specific model. The costs of training base models are one-off costs that are irrelevant for a given task (we do not train task-specific teachers, they can be re-used). So manual scoring and prompting an LLM basically have no (task-specific) training costs. Things are different with rule-based scoring, where a teacher must formalize their scoring rubric into rules to obtain a model. With shallow models, one first has to decide on a feature set, but the model can then be fit to the feature-based representations of answers quite cheaply, as the hardware requirements are modest in comparison to deep learning, which may require an expensive server (or the equivalent of cloud computing resources).
Scoring cost
While training costs are only incurred once per task, we might need to score thousands of answers per task. Teacher salaries can make scoring a large number of answers very costly. The potential savings in scoring costs are a major driver of automated scoring technology. Rule-based and shallow models are almost free, as they are very lightweight and can run on any computer. Deep learning classifiers are more costly, as they require a dedicated server. State-of-the-art large language models are currently so large that they can only be provided by a central authority (companies or governments, maybe universities). These resource demands come at both high monetary and environmental costs [44, 45].

Transferability

Since a key motivation in using automated scoring is efficiency, it is desirable to be able to re-use existing models in new application scenarios, i.e. new tasks or different languages.
Tasks
Whenever answers to a new task have to be evaluated, a suitable model has to be created first. To develop this new model, a certain amount of manually labelled answers to the new task is needed. Ideally, one could recycle an existing model and adapt it to the new task, which might require less manually labelled data than to create a new model from scratch. As deep prompting models only require a small set of exemplary answers for their application to a new task, one may see these as inherently transferable. With rule-based models, rules were designed specifically for the task at hand—these rules are unlikely to apply to a new task, making transfer an unpromising endeavour. Similarly, shallow methods fit a model onto how answer features are linked to scores. It is unlikely for the same features to be linked to the same scores in the same way in two different tasks. Shallow models also by design do not permit later adaptation of an already existing model. Instead, there are approaches where a new model is derived from combined data from two different tasks [46]. This opens the possibility of combining a larger volume of existing answers to a task with a smaller number of answers to a new task. Deep learning models do allow for iterative adaptation of an existing model. The success of this is, however, mixed, which is likely influenced by the relatedness of the different tasks [25, 47, 48].
Languages
If students were permitted to answer in their mother tongue, this could even out disadvantages that arise from students failing to adequately express themselves in a second language. Ideally, exactly the same model would be used to score answers in multiple languages, and training this model with answers in one language could automatically enable it to evaluate answers in other languages. In human evaluation, being able to manually evaluate answers hinges on the availability of a teacher proficient in the target language. One solution would be machine translation, which can homogenize a set of answers in different languages to one target language [49, 50]. Directly working on another language is difficult for rule-based and shallow methods, as rules and features are often specific to a language [51]. In contrast, there are pre-trained multilingual models that cover a wide range of different languages [26, 52]. These are based on embeddings aligned along different languages, opening the possibility of training in one language and evaluating in another. Models trained with a large number of answers in one language can then be used as a basis to evaluate answers in different languages [49].
Overall, language influences variance [17, 18] and thus scoring difficulty. There can be cases where answers to the same task are more difficult to score in one language than they are in another. Models may, for example, be more effective with languages that have a rather fixed word order.

Conclusion

This article outlined the strengths and weaknesses of models for the automated scoring of short free-text answers. We compare the four main categories of models (rule-based, shallow, deep and generative) across the key set of properties with respect to which they differ. These properties play a crucial role in deciding which model type is the most suitable for a given application scenario. This work is thus intended to foster informed decisions as to which model type to employ with respect to a certain use case. Overall, automated scoring can reduce the workload of teachers, freeing up time they can spend on other tasks. The extent to which manual labour is delegated to the scoring model is a key difference between approaches. In general, automation also comes with the benefit of increased scoring reliability. Unlike humans, which can be affected by fatigue or mood, a model will reliably give the same score to the same answer. Still, the application of automated scoring should be well thought out. Deciding on the best-suited method requires weighing up the aims and circumstances of a task. A central difficulty of many models is to explain how they came up with a score. While deep learning models tend to give the best performance and require the least manual involvement, their black-box nature makes it hard to understand how a certain score came to be. While it is important not to lose sight of legal restrictions and ethical considerations, research has demonstrated that, if sensibly designed, models are able to efficiently and reliably score free-text answers. Such models might play a decisive role in handling the increasing demand for individual feedback. We thus cannot afford to abstain from a technology that has the potential to allow fast-paced scoring of large volumes of learner answers.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://​creativecommons.​org/​licenses/​by/​4.​0/​.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Our product recommendations

Informatik-Spektrum

Hauptaufgabe dieser Zeitschrift ist die Publikation aktueller, praktisch verwertbarer Informationen über technische und wissenschaftliche Fortschritte aus allen Bereichen der Informatik und ihrer Anwendungen in Form von Übersichtsartikeln und einführenden Darstellungen sowie Berichten über Projekte und Fallstudien, die zukünftige Trends aufzeigen.

Literature
1.
go back to reference Ziai R, Ott N, Meurers D (2012) Short Answer Assessment: Establishing Links Between Research Strands. In: Tetreault J, Burstein J, Leacock C (Hrsg) Proceedings of the Seventh Workshop on Building Educational Applications Using NLP. Montréal. Association for Computational Linguistics, Canada, S 190–200 Ziai R, Ott N, Meurers D (2012) Short Answer Assessment: Establishing Links Between Research Strands. In: Tetreault J, Burstein J, Leacock C (Hrsg) Proceedings of the Seventh Workshop on Building Educational Applications Using NLP. Montréal. Association for Computational Linguistics, Canada, S 190–200
2.
go back to reference Burrows S, Gurevych I, Stein B (2015) The Eras and Trends of Automatic Short Answer Grading. Int J Artif Intell Educ 25(1):60–117CrossRef Burrows S, Gurevych I, Stein B (2015) The Eras and Trends of Automatic Short Answer Grading. Int J Artif Intell Educ 25(1):60–117CrossRef
3.
go back to reference Bai X, Stede M (2022) A Survey of Current Machine Learning Approaches to Student Free-Text Evaluation for Intelligent Tutoring. Int J Artif Intell Educ 28: Bai X, Stede M (2022) A Survey of Current Machine Learning Approaches to Student Free-Text Evaluation for Intelligent Tutoring. Int J Artif Intell Educ 28:
4.
go back to reference Thomson S, Hillman K (2020) The Teaching and Learning International Survey 2018. Australian Report Volume 2: Teachers and School Leaders as Valued Professionals. Oecd Teach Learn Int Surv Talis Thomson S, Hillman K (2020) The Teaching and Learning International Survey 2018. Australian Report Volume 2: Teachers and School Leaders as Valued Professionals. Oecd Teach Learn Int Surv Talis
5.
go back to reference Hahn MG, Navarro SMB, De La Fuente Valentín L, Burgos D (2021) A Systematic Review of the Effects of Automatic Scoring and Automatic Feedback in Educational Settings. IEEE Access 9:108190–108198CrossRef Hahn MG, Navarro SMB, De La Fuente Valentín L, Burgos D (2021) A Systematic Review of the Effects of Automatic Scoring and Automatic Feedback in Educational Settings. IEEE Access 9:108190–108198CrossRef
6.
go back to reference Pei J, Li J 30 Million Canvas Grading Records Reveal Widespread Sequential Bias and System-Induced Surname Initial Disparity Pei J, Li J 30 Million Canvas Grading Records Reveal Widespread Sequential Bias and System-Induced Surname Initial Disparity
7.
go back to reference Klein J, El LP (2003) Impairment of teacher efficiency during extended sessions of test correction. Eur J Teach Educ 26(3):379–392CrossRef Klein J, El LP (2003) Impairment of teacher efficiency during extended sessions of test correction. Eur J Teach Educ 26(3):379–392CrossRef
8.
go back to reference Brackett MA, Floman JL, Ashton-James C, Cherkasskiy L, Salovey P (2013) The influence of teacher emotion on grading practices: a preliminary look at the evaluation of student writing. Teach Teach 19(6):634–646CrossRef Brackett MA, Floman JL, Ashton-James C, Cherkasskiy L, Salovey P (2013) The influence of teacher emotion on grading practices: a preliminary look at the evaluation of student writing. Teach Teach 19(6):634–646CrossRef
9.
go back to reference Eckes T (2008) Rater types in writing performance assessments: A classification approach to rater variability. Lang Test 25(2):155–185CrossRef Eckes T (2008) Rater types in writing performance assessments: A classification approach to rater variability. Lang Test 25(2):155–185CrossRef
10.
go back to reference - (2021) Proposal for a Regulation of the European Parliament And of the Council Laying Down Harmonised Rules on Artificial Intelligence (Artificial Intelligence Act) and Amending Certain Union Legislative Acts - (2021) Proposal for a Regulation of the European Parliament And of the Council Laying Down Harmonised Rules on Artificial Intelligence (Artificial Intelligence Act) and Amending Certain Union Legislative Acts
11.
go back to reference Association AER, Association AP (1985) Education NC on M. In: Standards for educational and psychological testing Association AER, Association AP (1985) Education NC on M. In: Standards for educational and psychological testing
12.
go back to reference Kokalj E, Škrlj B, Lavrač N, Pollak S, Robnik-Šikonja M (2021) BERT meets Shapley: Extending SHAP Explanations to Transformer-based Classifiers. In: Toivonen H, Boggia M (Hrsg) Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation. Online: Association for Computational Linguistics, In, S 16–21 Kokalj E, Škrlj B, Lavrač N, Pollak S, Robnik-Šikonja M (2021) BERT meets Shapley: Extending SHAP Explanations to Transformer-based Classifiers. In: Toivonen H, Boggia M (Hrsg) Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation. Online: Association for Computational Linguistics, In, S 16–21
13.
go back to reference Goddard K, Roudsari A, Wyatt JC (2012) Automation bias: a systematic review of frequency, effect mediators, and mitigators. J Am Med Inform Assoc 19(1):121–127CrossRef Goddard K, Roudsari A, Wyatt JC (2012) Automation bias: a systematic review of frequency, effect mediators, and mitigators. J Am Med Inform Assoc 19(1):121–127CrossRef
14.
go back to reference Skitka LJ, Mosier KL, Burdick M (1999) Does automation bias decision-making? Int J Hum Comput Stud 51(5):991–1006CrossRef Skitka LJ, Mosier KL, Burdick M (1999) Does automation bias decision-making? Int J Hum Comput Stud 51(5):991–1006CrossRef
15.
go back to reference Loukina A, Madnani N, Zechner K (2019) The many dimensions of algorithmic fairness in educational applications. In: Yannakoudakis H, Kochmar E, Leacock C, Madnani N, Pilán I, Zesch T (Hrsg) Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications. Association for Computational Linguistics, Florence, Italy, S 1–10 Loukina A, Madnani N, Zechner K (2019) The many dimensions of algorithmic fairness in educational applications. In: Yannakoudakis H, Kochmar E, Leacock C, Madnani N, Pilán I, Zesch T (Hrsg) Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications. Association for Computational Linguistics, Florence, Italy, S 1–10
16.
go back to reference Williamson DM, Xi X, Breyer FJ (2012) A Framework for Evaluation and Use of Automated Scoring. Educ Meas Issues Pract 31(1):2–13CrossRef Williamson DM, Xi X, Breyer FJ (2012) A Framework for Evaluation and Use of Automated Scoring. Educ Meas Issues Pract 31(1):2–13CrossRef
17.
go back to reference Horbach A, Zesch T (2020) The Influence of Variance in Learner Answers on Automatic Content Scoring. Adv Technol-Based Assess Emerg Item Formats Test Des Data Sources Horbach A, Zesch T (2020) The Influence of Variance in Learner Answers on Automatic Content Scoring. Adv Technol-Based Assess Emerg Item Formats Test Des Data Sources
18.
go back to reference Zesch T, Horbach A, Zehner F (2023) To Score or Not to Score: Factors Influencing Performance and Feasibility of Automatic Content Scoring of Text Responses. Educ Meas Issues Pract 42(1):44–58CrossRef Zesch T, Horbach A, Zehner F (2023) To Score or Not to Score: Factors Influencing Performance and Feasibility of Automatic Content Scoring of Text Responses. Educ Meas Issues Pract 42(1):44–58CrossRef
19.
go back to reference Bexte M, Horbach A, Zesch T (2021) Implicit Phenomena in Short-answer Scoring Data. In: Proceedings of the 1st Workshop on Understanding Implicit and Underspecified Language, S 11–19CrossRef Bexte M, Horbach A, Zesch T (2021) Implicit Phenomena in Short-answer Scoring Data. In: Proceedings of the 1st Workshop on Understanding Implicit and Underspecified Language, S 11–19CrossRef
20.
go back to reference Dzikovska M, Nielsen R, Brew C, Leacock C, Giampiccolo D, Bentivogli L et al (2013) SemEval-2013 Task 7: The Joint Student Response Analysis and 8th Recognizing Textual Entailment Challenge. Second Joint Conference on Lexical and Computational Semantics (*SEM). Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), Bd. 2. Association for Computational Linguistics, Atlanta, Georgia, S 263–274 Dzikovska M, Nielsen R, Brew C, Leacock C, Giampiccolo D, Bentivogli L et al (2013) SemEval-2013 Task 7: The Joint Student Response Analysis and 8th Recognizing Textual Entailment Challenge. Second Joint Conference on Lexical and Computational Semantics (*SEM). Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), Bd. 2. Association for Computational Linguistics, Atlanta, Georgia, S 263–274
21.
go back to reference Gaddipati SK, Nair D, Plöger PG (2020) Comparative Evaluation of Pretrained Transfer Learning Models on Automatic Short Answer Grading. arXiv Gaddipati SK, Nair D, Plöger PG (2020) Comparative Evaluation of Pretrained Transfer Learning Models on Automatic Short Answer Grading. arXiv
22.
go back to reference Wu X, He X, Liu T, Liu N, Zhai X (2023) Matching Exemplar as Next Sentence Prediction (MeNSP): Zero-Shot Prompt Learning for Automatic Scoring in Science Education. In: Wang N, Rebolledo-Mendez G, Matsuda N, Santos OC, Dimitrova V (Hrsg) Artificial Intelligence in Education. Lecture Notes in Computer Science. Springer Nature Switzerland, Cham, S 401–413CrossRef Wu X, He X, Liu T, Liu N, Zhai X (2023) Matching Exemplar as Next Sentence Prediction (MeNSP): Zero-Shot Prompt Learning for Automatic Scoring in Science Education. In: Wang N, Rebolledo-Mendez G, Matsuda N, Santos OC, Dimitrova V (Hrsg) Artificial Intelligence in Education. Lecture Notes in Computer Science. Springer Nature Switzerland, Cham, S 401–413CrossRef
23.
go back to reference Marvaniya S, Saha S, Dhamecha TI, Foltz P, Sindhgatta R, Sengupta B (2018) Creating Scoring Rubric from Representative Student Answers for Improved Short Answer Grading. In: Proceedings of the 27th ACM International Conference on Information and Knowledge Management. CIKM ’18. Association for Computing Machinery, New York, NY, USA, S 993–1002 Marvaniya S, Saha S, Dhamecha TI, Foltz P, Sindhgatta R, Sengupta B (2018) Creating Scoring Rubric from Representative Student Answers for Improved Short Answer Grading. In: Proceedings of the 27th ACM International Conference on Information and Knowledge Management. CIKM ’18. Association for Computing Machinery, New York, NY, USA, S 993–1002
24.
go back to reference Bexte M, Horbach A, Zesch T (2022) Similarity-Based Content Scoring—How to Make S‑BERT Keep Up With BERT. In: Kochmar E, Burstein J, Horbach A, Laarmann-Quante R, Madnani N, Tack A et al (Hrsg) Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022). Association for Computational Linguistics, Seattle, Washington, S 118–123 Bexte M, Horbach A, Zesch T (2022) Similarity-Based Content Scoring—How to Make S‑BERT Keep Up With BERT. In: Kochmar E, Burstein J, Horbach A, Laarmann-Quante R, Madnani N, Tack A et al (Hrsg) Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022). Association for Computational Linguistics, Seattle, Washington, S 118–123
25.
go back to reference Bexte M, Horbach A, Zesch T (2023) Similarity-based content scoring—A more classroom-suitable alternative to instance-based scoring? In: Findings of the Association for Computational Linguistics: ACL 2023. Association for Computational Linguistics, Toronto, Canada Bexte M, Horbach A, Zesch T (2023) Similarity-based content scoring—A more classroom-suitable alternative to instance-based scoring? In: Findings of the Association for Computational Linguistics: ACL 2023. Association for Computational Linguistics, Toronto, Canada
26.
go back to reference Devlin J, Chang MW, Lee K, Toutanova K (2019) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1. Long and Short Papers. Association for Computational Linguistics, Minneapolis, Minnesota, S 4171–4186 Devlin J, Chang MW, Lee K, Toutanova K (2019) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1. Long and Short Papers. Association for Computational Linguistics, Minneapolis, Minnesota, S 4171–4186
27.
go back to reference Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P et al (2020) Language Models are Few-Shot Learners. Adv Neural Inf Process Syst 33:1877–1901 Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P et al (2020) Language Models are Few-Shot Learners. Adv Neural Inf Process Syst 33:1877–1901
28.
go back to reference Open AI, Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I et al (2023) GPT‑4 Technical Report. ArXiv Open AI, Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I et al (2023) GPT‑4 Technical Report. ArXiv
29.
go back to reference Luccioni AS, Rogers A (2023) Mind your Language (Model): Fact-Checking LLMs and their Role in NLP Research and Practice. ArXiv Luccioni AS, Rogers A (2023) Mind your Language (Model): Fact-Checking LLMs and their Role in NLP Research and Practice. ArXiv
30.
go back to reference Shermis MD (2014) State-of-the-art automated essay scoring: Competition, results, and future directions from a United States demonstration. Assess Writ 20:53–76CrossRef Shermis MD (2014) State-of-the-art automated essay scoring: Competition, results, and future directions from a United States demonstration. Assess Writ 20:53–76CrossRef
31.
go back to reference Latif E, Zhai X (2024) Fine-tuning ChatGPT for automatic scoring. Comput Educ Artif Intell 6:100210CrossRef Latif E, Zhai X (2024) Fine-tuning ChatGPT for automatic scoring. Comput Educ Artif Intell 6:100210CrossRef
32.
go back to reference Bewersdorff A, Seßler K, Baur A, Kasneci E, Nerdel C (2023) Assessing student errors in experimentation using artificial intelligence and large language models: A comparative study with human raters. Comput Educ Artif Intell 5:100177CrossRef Bewersdorff A, Seßler K, Baur A, Kasneci E, Nerdel C (2023) Assessing student errors in experimentation using artificial intelligence and large language models: A comparative study with human raters. Comput Educ Artif Intell 5:100177CrossRef
33.
go back to reference Chamieh I, Zesch T, Giebermann K (2024) LLMs in Short Answer Scoring: Limitations and Promise of Zero-Shot and Few-Shot Approaches. In: Kochmar E, Bexte M, Burstein J, Horbach A, Laarmann-Quante R, Tack A et al (Hrsg) Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024). Association for Computational Linguistics, Mexico City, Mexico, S 309–315 Chamieh I, Zesch T, Giebermann K (2024) LLMs in Short Answer Scoring: Limitations and Promise of Zero-Shot and Few-Shot Approaches. In: Kochmar E, Bexte M, Burstein J, Horbach A, Laarmann-Quante R, Tack A et al (Hrsg) Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024). Association for Computational Linguistics, Mexico City, Mexico, S 309–315
34.
go back to reference Riordan B, Flor M, Pugh R (2019) How to account for mispellings: Quantifying the benefit of character representations in neural content scoring models. In: Yannakoudakis H, Kochmar E, Leacock C, Madnani N, Pilán I, Zesch T (Hrsg) Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications. Association for Computational Linguistics, Florence, Italy, S 116–126CrossRef Riordan B, Flor M, Pugh R (2019) How to account for mispellings: Quantifying the benefit of character representations in neural content scoring models. In: Yannakoudakis H, Kochmar E, Leacock C, Madnani N, Pilán I, Zesch T (Hrsg) Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications. Association for Computational Linguistics, Florence, Italy, S 116–126CrossRef
35.
go back to reference Horbach A, Ding Y, Zesch T (2017) The Influence of Spelling Errors on Content Scoring Performance. In: Tseng YH, Chen HH, Lee LH, Yu LC (Hrsg) Proceedings of the 4th Workshop on Natural Language Processing Techniques for Educational Applications (NLPTEA 2017). Asian Federation of Natural Language Processing, Taipei, Taiwan, S 45–53 Horbach A, Ding Y, Zesch T (2017) The Influence of Spelling Errors on Content Scoring Performance. In: Tseng YH, Chen HH, Lee LH, Yu LC (Hrsg) Proceedings of the 4th Workshop on Natural Language Processing Techniques for Educational Applications (NLPTEA 2017). Asian Federation of Natural Language Processing, Taipei, Taiwan, S 45–53
36.
go back to reference Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W et al (2016) Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. ArXiv Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W et al (2016) Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. ArXiv
37.
go back to reference Erickson JA, Botelho A (2021) Is it Fair? Automated Open Response Grading. In: International Conference on Educational Data Mining Erickson JA, Botelho A (2021) Is it Fair? Automated Open Response Grading. In: International Conference on Educational Data Mining
38.
go back to reference Ding Y, Riordan B, Horbach A, Cahill A, Zesch T (2020) Don’t take “nswvtnvakgxpm” for an answer—The surprising vulnerability of automatic content scoring systems to adversarial input. In: Scott D, Bel N, Zong C (Hrsg) Proceedings of the 28th International Conference on Computational Linguistics. International Committee on Computational Linguistics, Barcelona, Spain, S 882–892 Ding Y, Riordan B, Horbach A, Cahill A, Zesch T (2020) Don’t take “nswvtnvakgxpm” for an answer—The surprising vulnerability of automatic content scoring systems to adversarial input. In: Scott D, Bel N, Zong C (Hrsg) Proceedings of the 28th International Conference on Computational Linguistics. International Committee on Computational Linguistics, Barcelona, Spain, S 882–892
39.
go back to reference Filighera A, Steuer T, Rensing C (2020) Fooling Automatic Short Answer Grading Systems. In: Bittencourt II, Cukurova M, Muldner K, Luckin R, Millán E (Hrsg) Artificial Intelligence in Education. Lecture Notes in Computer Science. Springer, Cham, S 177–190CrossRef Filighera A, Steuer T, Rensing C (2020) Fooling Automatic Short Answer Grading Systems. In: Bittencourt II, Cukurova M, Muldner K, Luckin R, Millán E (Hrsg) Artificial Intelligence in Education. Lecture Notes in Computer Science. Springer, Cham, S 177–190CrossRef
40.
go back to reference Higgins D, Heilman M (2014) Managing What We Can Measure: Quantifying the Susceptibility of Automated Scoring Systems to Gaming Behavior. Educ Meas Issues Pract 33(3):36–46CrossRef Higgins D, Heilman M (2014) Managing What We Can Measure: Quantifying the Susceptibility of Automated Scoring Systems to Gaming Behavior. Educ Meas Issues Pract 33(3):36–46CrossRef
41.
go back to reference Sclar M, Choi Y, Tsvetkov Y, Suhr A (2023) Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting. In: The Twelfth International Conference on Learning Representations Sclar M, Choi Y, Tsvetkov Y, Suhr A (2023) Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting. In: The Twelfth International Conference on Learning Representations
42.
go back to reference Chefer H, Gur S, Wolf L (2021) Transformer Interpretability Beyond Attention Visualization, S 782–791 Chefer H, Gur S, Wolf L (2021) Transformer Interpretability Beyond Attention Visualization, S 782–791
43.
go back to reference Suto I (2012) A Critical Review of Some Qualitative Research Methods Used to Explore Rater Cognition. Educ Meas Issues Pract 31(3):21–30CrossRef Suto I (2012) A Critical Review of Some Qualitative Research Methods Used to Explore Rater Cognition. Educ Meas Issues Pract 31(3):21–30CrossRef
44.
go back to reference Patterson D, Gonzalez J, Le Q, Liang C, Munguia LM, Rothchild D et al (2021) Carbon Emissions and Large Neural Network Training. ArXiv Patterson D, Gonzalez J, Le Q, Liang C, Munguia LM, Rothchild D et al (2021) Carbon Emissions and Large Neural Network Training. ArXiv
45.
go back to reference Everman B, Villwock T, Chen D, Soto N, Zhang O, Zong Z (2023) Evaluating the Carbon Impact of Large Language Models at the Inference Stage. In: 2023 IEEE International Performance, Computing, and Communications Conference (IPCCC), S 150–157CrossRef Everman B, Villwock T, Chen D, Soto N, Zhang O, Zong Z (2023) Evaluating the Carbon Impact of Large Language Models at the Inference Stage. In: 2023 IEEE International Performance, Computing, and Communications Conference (IPCCC), S 150–157CrossRef
46.
go back to reference Daumé H III (2007) Frustratingly Easy Domain Adaptation. In: Zaenen A, van den Bosch A (Hrsg) Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics. Association for Computational Linguistics, Prague, Czech Republic, S 256–263 Daumé H III (2007) Frustratingly Easy Domain Adaptation. In: Zaenen A, van den Bosch A (Hrsg) Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics. Association for Computational Linguistics, Prague, Czech Republic, S 256–263
47.
go back to reference Camus L, Filighera A (2020) Investigating Transformers for Automatic Short Answer Grading. In: Bittencourt II, Cukurova M, Muldner K, Luckin R, Millán E (Hrsg) Artificial Intelligence in Education. Lecture Notes in Computer Science. Springer, Cham, S 43–48CrossRef Camus L, Filighera A (2020) Investigating Transformers for Automatic Short Answer Grading. In: Bittencourt II, Cukurova M, Muldner K, Luckin R, Millán E (Hrsg) Artificial Intelligence in Education. Lecture Notes in Computer Science. Springer, Cham, S 43–48CrossRef
48.
go back to reference Funayama H, Asazuma Y, Matsubayashi Y, Mizumoto T, Inui K (2023) Reducing the Cost: Cross-Prompt Pre-finetuning for Short Answer Scoring. In: Wang N, Rebolledo-Mendez G, Matsuda N, Santos OC, Dimitrova V (Hrsg) Artificial Intelligence in Education. Lecture Notes in Computer Science. Springer Nature Switzerland, Cham, S 78–89CrossRef Funayama H, Asazuma Y, Matsubayashi Y, Mizumoto T, Inui K (2023) Reducing the Cost: Cross-Prompt Pre-finetuning for Short Answer Scoring. In: Wang N, Rebolledo-Mendez G, Matsuda N, Santos OC, Dimitrova V (Hrsg) Artificial Intelligence in Education. Lecture Notes in Computer Science. Springer Nature Switzerland, Cham, S 78–89CrossRef
49.
go back to reference Horbach A, Pehlke J, Laarmann-Quante R, Ding Y (2023) Crosslingual Content Scoring in Five Languages Using Machine-Translation and Multilingual Transformer Models. Int J Artif Intell Educ Horbach A, Pehlke J, Laarmann-Quante R, Ding Y (2023) Crosslingual Content Scoring in Five Languages Using Machine-Translation and Multilingual Transformer Models. Int J Artif Intell Educ
50.
go back to reference Horbach A, Stennmanns S, Zesch T (2018) Cross-Lingual Content Scoring. In: Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications. Association for Computational Linguistics, New Orleans, Louisiana, S 410–419CrossRef Horbach A, Stennmanns S, Zesch T (2018) Cross-Lingual Content Scoring. In: Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications. Association for Computational Linguistics, New Orleans, Louisiana, S 410–419CrossRef
51.
go back to reference Ding Y, Horbach A, Zesch T (2020) Chinese Content Scoring: Open-Access Datasets and Features on Different Segmentation Levels. In: Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, Suzhou, China, S 347–357 Ding Y, Horbach A, Zesch T (2020) Chinese Content Scoring: Open-Access Datasets and Features on Different Segmentation Levels. In: Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, Suzhou, China, S 347–357
52.
go back to reference Conneau A, Khandelwal K, Goyal N, Chaudhary V, Wenzek G, Guzmán F et al (2020) Unsupervised Cross-lingual Representation Learning at Scale. In: Jurafsky D, Chai J, Schluter N, Tetreault J (Hrsg) Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, S 8440–8451CrossRef Conneau A, Khandelwal K, Goyal N, Chaudhary V, Wenzek G, Guzmán F et al (2020) Unsupervised Cross-lingual Representation Learning at Scale. In: Jurafsky D, Chai J, Schluter N, Tetreault J (Hrsg) Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, S 8440–8451CrossRef
Metadata
Title
Strengths and weaknesses of automated scoring of free-text student answers
Authors
Marie Bexte
Andrea Horbach
Torsten Zesch
Publication date
16-09-2024
Publisher
Springer Berlin Heidelberg
Published in
Informatik Spektrum
Print ISSN: 0170-6012
Electronic ISSN: 1432-122X
DOI
https://doi.org/10.1007/s00287-024-01573-z

Premium Partner