Zum Inhalt
Erschienen in:

Open Access 2025 | OriginalPaper | Buchkapitel

Formal Autograding in a Classroom

verfasst von : Dragana Milovančević, Mario Bucev, Marcin Wojnarowski, Samuel Chassot, Viktor Kunčak

Erschienen in: Programming Languages and Systems

Verlag: Springer Nature Switzerland

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Das Kapitel befasst sich mit der wachsenden Notwendigkeit automatisierter Notengebung in Programmierkursen aufgrund steigender Studentenzahlen und der Beschränkungen traditioneller, prüfungsbasierter Notensysteme. Es führt den Einsatz formaler Verifikationswerkzeuge ein, insbesondere des Stainless-Verifiers für Scala, um die Genauigkeit und Tiefe automatisierter Rückmeldungen zu verbessern. Das Experiment, das an der EPFL durchgeführt wurde, zeigt, wie formale Verifikation Unterschiede in der Programmstruktur aufzeigen kann, selbst wenn Lösungen ein identisches Input-Output-Verhalten aufweisen. Dieser Ansatz ermöglicht es Ausbildern, einzigartige oder suboptimale Lösungen zu identifizieren und zu diskutieren und so ein umfassenderes Lernerlebnis zu bieten. In diesem Kapitel werden auch die Herausforderungen und Lehren aus dem Einsatz des Stainless-Prüfgeräts in einem Live-Klassenzimmer diskutiert, einschließlich der Bedeutung vielfältiger Referenzlösungen und des Potenzials für eine kontinuierliche Verbesserung des Notensystems. Die Ergebnisse unterstreichen das Potenzial der formalen Verifizierung, das Feedback der Studenten zu bereichern und hochwertige Interaktionen zwischen Dozenten und Studenten zu unterstützen, was sie zu einer wertvollen Ergänzung zu automatisierten Bewertungstools macht.

1 Introduction

With the growing numbers of students in programming courses, automated grading of programming assignments has become ubiquitous. The goal of practical and rigorous automated grading has motivated the development of a number of tools used in programming courses [1, 26, 37, 53, 58, 60, 66]. These tools commonly employ automated testing to assess the correctness of student submissions [40].
We observed this general trend in the functional programming course at EPFL. Whereas seven years ago the course counted around 200 students, this number has more than doubled since then. To maintain a reasonable workload for the teaching staff, the course heavily relies on testing to automate the assessment of programming assignments throughout the semester and for the final exam. EPFL also developed one of the first Coursera MOOCs in Europe [41], which also uses automated grading.
While highly automated, testing-based grading comes at the cost of accuracy and feedback quality. Researchers have shown that flaws in test suites and the consequential misclassifications of student solutions can have a negative impact on students [67]. Furthermore, testing-based graders fail to provide solution-specific feedback [11, 21] on both correct and incorrect solutions. Finally, in large programming courses, the use of fully automated grading tools can lead to a situation where most students never get to speak with instructors [24], relying solely on the impersonal feedback from automated tools.
Our work aims to provide deeper and more trustworthy automated feedback, while also allowing instructors to detect and analyse solutions that would benefit from further guidance. This approach would enable high-value interactions with students when needed without significantly increasing the instructors’ workload.
As an example, consider the student submissions from Figure 1, computing the gcd of two natural numbers. The submission from Figure 1c exhibits the same input-output behaviour as all the other submissions from Figure 1, but is notably less efficient. This submission would thus typically go unnoticed with a testing-based grader, even though it would be interesting to discuss this approach with students. Furthermore, while all the submissions from Figure 1 have identical input-output behaviour, we can identify two different recursion structures (subtraction-based vs. modulo-based Euclid’s algorithm). We would like our grader to classify submissions not only by correctness, but also based on the underlying program structure. Such motivation is also confirmed by the Rainfall study from functional programming education [17], which illustrates the variety in high-level structures of student submissions using manual analysis. Indeed, such classification allows instructors to quickly get an idea of the different approaches followed by students, and to detect those that would benefit from additional feedback.
In this study, we propose the use of formal verification tools for automated grading and analysis of the space of student submissions. We consider functional induction as the strategy of choice for proving program correctness and recursion as the indicator of the underlying program structure. We use the Stainless verifier for Scala [29], which was previously evaluated on equivalence checking benchmarks, including programming assignments in an offline setting, translated from OCaml to Scala [42]. The theory and encouraging results of past programming languages research already suggest that verification tools can be used as grading assistants for program clustering [11, 42], but these do not include in-classroom deployment. In this paper, we report on our experiment using the Stainless verifier as a grading assistant in the live setting of an undergraduate programming course at EPFL. To encourage other researchers and educators to adopt verification tools in programming classrooms, we document our experience, what did, and what did not work in practice.
The main contributions of this paper are as follows:
  • We report on our pilot experiment deploying the Stainless verifier as an automated grader for introductory exercises in our programming course. We describe our deployment process in detail and share our insights.
  • We show that formal verification can reveal variations in the underlying program structure of student solutions, such as different recursion schemas, even when solutions have identical input-output behaviour.
  • We publish a new data set with over 700 Scala programs, alongside the exercise material [43].
Fig. 1:
Three clusters of student submissions for the gcd exercise, illustrating different recursion schemas.
Fig. 2:
An incorrect submission from the drop exercise. Stainless detects a concrete counterexample, which we communicate as feedback to the student, and use on the instructor side by adding it to the test suite.

2 Background: Stainless Verifier

One differentiating characteristic of our study is the use of a formal verifier. Our course uses Scala [49], so we adopt the Stainless verifier [36]. Stainless can prove program termination and the absence of runtime errors such as division by zero. Developers can optionally provide program specification using contracts, such as pre- and post-conditions. Stainless supports only a subset of Scala, with best support for a purely functional subset with ML-style polymorphism.1 This subset largely aligns with the concepts that we present in our course.
Stainless works by first parsing and type checking the input program using the Scala compiler to generate an abstract syntax tree (AST). Stainless then transforms this AST into an equivalent purely functional program, from which it generates verification conditions using a type checking algorithm [29]. It iteratively unfolds recursive functions [61, 64] and uses SMT solvers (Z3 [13], cvc5 [2], Princess [55]) to prove or disprove those verification conditions.
Our deployment makes use of a high-level functionality of Stainless to perform program equivalence checks via automated proofs by functional induction [42]. In this mode, rather than writing pre- and post-conditions, users provide specification in the form of a reference program. Stainless then attempts to prove program correctness via automated equivalence checks against the reference program. Our deployment also exploits the ability of Stainless to generate counterexamples for incorrect programs [64], providing feedback valuable to students and instructors (Figure 2).

3 The Experiment

In this section, we describe our experiment deploying a Stainless-backed formal autograder in a second-year undergraduate course that teaches software construction through functional programming to around 400 students.

3.1 Teaching Software Construction Using Scala

The goal of our Software Construction course is to teach functional programming and reasoning about programs, along with software engineering concepts and skills. This includes concepts such as subtyping, polymorphism, structural induction, (tail) recursion, as well as soft skills like debugging, reading and writing specifications, or using libraries. During the semester, students work on 12 homework assignments, designed to produce interesting or practical programs, including games as web applications, dynamical system simulations, and file system traversals. In addition to these graded projects, students work on exercise sets, consisting of smaller problems designed to help grasp the course material, such as evaluation of algebraic expressions, manual recursion elimination, and memoisation.

3.2 Experimental Setup

Our experiment took place during the fall semester of 2023. We prepared four autograded exercises in the form of optional individual short programming assignments to be solved and submitted on a computer. Table 1 describes our exercises. We deployed the exercises in the last week of the course, one month before the final exam. This decision was made in part to avoid interfering with other aspects of the course, which was given for the first time in this form. We provided a dedicated section of the course’s forum for questions and discussions about these optional exercises.
Students were invited to participate in the study by submitting their solutions, with an option to permit the public release of those submissions. The code was automatically graded by running tests and formal equivalence checks against our reference solution. We initially deployed one reference solution per exercise. During the experiment, we added another reference solution for the drop exercise, as discussed in Section 4. For each submission, students received automated feedback generated by our system. They were permitted to re-submit their solutions any number of times.
At the end of the course, we further examined the submissions using Stainless as a grading assistant on the instructor side. We gathered all the submissions proven correct, together with the reference solutions, and all the submissions that passed the tests, but the equivalence proof timed out against the reference solution. We then run the equivalence checking for each pair of programs in the entire set, to identify and analyse clusters of provably equivalent submissions.
Section 3.3 describes our deployment and Section 3.4 summarizes the results.
Table 1:
Four exercises deployed in our experiment. #P: number of submissions per exercise. #F and LOC: average number of functions and lines of code per submission, respectively.
Name
Description
#P
   #F
 LOC
drop
Drop every n-th element from a list
373
1.8
13
gcd
Find the greatest common divisor
80
1.3
11
prime
Check if an integer is prime
220
3.7
19
infix
Implement infix boolean operators
46
7.8
17

3.3 Setting Up Stainless as a Grader

We next describe how we adapted the Stainless verifier to be used as a scalable on-premise grading service.
Grading Infrastructure. Our students use the Moodle educational platform to obtain and submit graded assignments. An internal Kubernetes service evaluates the submissions by running specified tests and uploading the grade and the feedback to Moodle. Naturally, we integrate our optional autograded exercises as Moodle assignments. This setup yields a simple process for the students, as there is no need to learn and use additional platforms. We use a custom Moodle plugin [7] to automatically run Stainless as a service upon each assignment submission and to report grades and feedback to students. We create a dedicated Docker image for each exercise, packaged with Stainless and Z3 solver. An orchestrating script collects submissions, feeds them into Stainless along with the reference solutions, and produces feedback from the output of Stainless.
Exercise Setup. To illustrate the usability of our approach, we detail the steps needed to prepare each exercise:
1.
Write the problem statement as a markdown file.
 
2.
Set up a dedicated Scala project with the source files.
 
3.
Write the reference solution(s).
 
4.
Write unit tests for students to run their solutions locally.
 
5.
Write a configuration file listing the name of the function(s) and the reference solution(s) for the equivalence checking.
 
6.
Compile everything into a Docker image linked to Moodle.
 
Here, only Step 5 is specific to formal equivalence checks. The other steps were already in place for the existing testing-based grader.
Fig. 3:
Data collection pipeline.
Feedback Generation. For each submission, students receive a numerical grade, a feedback file with comments for each function, and a log file disclosing a detailed output of Stainless. The feedback file either contains a congratulation message, in case the submission is proven correct, or otherwise an explanation of the encountered error(s), such as in our example shown in Figure 2. We report custom feedback for the following errors, allowing students to iteratively improve their code until they solve each problem:
  • User errors – such as an incorrectly named file or an incorrect function signature
  • Safety errors – such as division by zero or integer overflow
  • Counterexample errors – a counterexample input was found, proving that the submission is incorrect
  • Termination errors – a function could not be proven terminating within the specified timeout for SMT queries
  • Equivalence errors – a function could not be proven equivalent to any reference solution within the specified timeout for SMT queries
Data Collection. Figure 3 shows our data collection pipeline. Upon each submission, the data is directly anonymized and copied out. Each submission is initially identified by a UUID, also included on Moodle to create a one-way link, allowing us to remove collected samples in case a student opts-out of participation.

3.4 Results

Table 1 shows the total number of submissions per exercise, along with the average number of functions per submission and average number of lines of code per submission. Our data set comprising 709 Scala programs is publicly available under a permissive licence [43]. We hope that the data set will be useful to evaluate future research on programming education, which lacks public data sets.2
We export the generated feedback and present the results per exercise in Table 2. Out of the 400 students taking the course, 201 students agreed to participate in the study. Several students actively engaged in discussions on the course forum, with over 70 posts total. While solving the exercises, some students submitted many attempts, and some students skipped some exercises, resulting in a total of 719 submissions. After removing byte-identical files, we were left with 709 submissions. Around one third of the submissions are not of interest due to compilation errors (Column S), resulting in 480 syntactically valid programs.
Table 2:
Results of our experiment. S indicates submissions that could not be processed due to compilation errors, or due to students submitting wrong files. TS indicates submissions where Stainless could not prove safety checks. I and C show numbers of submissions proven incorrect and correct, respectively. TO indicates submissions where the equivalence check times out, left for clustering analysis. The last two columns show numbers of non-singleton and singleton clusters.
Name
 S
    TS
    I
    C
    TO
 Non-Singleton
    Singleton
drop
60
56
169
14
70
10
26
gcd
21
2
8
3
42
2
12
prime
146
9
40
1
22
4
11
infix
2
0
2
42
0
1
0
Column I shows the number of incorrect submissions, evidenced by a counterexample (46%). Column C shows the number of submissions that our verifier proved equivalent to the reference solution, and therefore correct (13%). The remaining submissions are neither provably correct nor provably incorrect (42%), and are left for further manual inspection and clustering analysis.
The last two columns show the results of our clustering analysis. Column “Non-Singleton” refers to the number of clusters with two or more equivalent submissions. Column “Singleton” denotes the number of submissions that are not provably equivalent to any other submission, and thus form a singleton cluster of their own.

4 Discussion and Lessons Learned

In this section, we interpret the results of our experiment and summarize our takeaways. Figure 4 depicts the clusters of submissions, where nodes represent submissions and edges represent direct equivalence proofs between submissions.
Fig. 4:
Non-singleton clusters of submissions originally classified as correct or timed-out for all four exercises. Nodes represent submissions and edges represent direct equivalence proofs. The accompanying artifact contains .gexf files allowing further interactive inspection and graph manipulation via open-source tools such as https://​gephi.​org/​gephi-lite/​.
Solution Variations. We find that the verifier’s success rate is lower than in the evaluation in an offline setting [42], with the majority of correct submissions timing out. We believe that this difference is due to the scarcity of reference solutions: in our experiment, we only provided one reference solution per exercise. Based on student feedback, this issue already became apparent during the semester. We addressed this problem by adding another reference solution for the drop exercise midway through the experiment, which improved the success rate.
Figure 4a shows the non-singleton clusters of submissions for the drop exercise. Our initial reference solution (node R) is in a cluster of size 7. Our additional reference solution (node R’), is in another cluster of size 17. The main difference between the two clusters is in counting to each n-th element, with solutions in the smaller cluster counting backwards from n down to 1 and solutions in the larger cluster counting forward from 1 to n. Another variation of the algorithm counts forward until the end of the list, computing each time the counter modulo n. This variation is problematic for potential overflows and for termination checks, although in practice we can assume that the size of the input list is smaller than Int.MaxValue.
We next consider the non-singleton clusters of submissions for the gcd exercise (Figure 4b). Programs from the smaller cluster (7 programs, 2 of which are shown in Figure 1a) use the subtraction-based Euclid’s algorithm. Programs from the larger cluster (27 programs, 4 of which are shown in Figure 1b) use the modulo-based variation. Both clusters correspond to valid solutions for the gcd exercise. Furthermore, neither approach can be considered strictly better than the other. Takeaway: Even in introductory exercises, a single reference solution is not sufficient to capture the variety in student submissions. One should aim to provide a diverse set of reference solutions to reflect the diversity of student submissions.
Unique Solutions. Some submissions have a unique recursive structure and thus form a cluster of their own (column “Singleton” in Table 2). In the Ask-Elle studies [21], the authors devote particular attention to submissions that do not get matched against any reference solution and belong to a separate “correct (but no match)” category. Singleton clusters also appear in the Rainfall study [17], where they end up in a dedicated “other/unclear composition” category. We inspect the singleton clusters of submissions for the gcd exercise and identify the underlying causes for this classification:
  • unique solution, identified by its unique recursion schema (Figure 1c)
  • limitations of equivalence checking in Stainless (Figure 5a)
  • limitations of formal equivalence checking (Figure 5b)
Despite passing all the tests, upon manual inspection, we were able to identify suboptimal implementations among those submissions. In introductory programming courses, such submissions could nevertheless be rewarded the maximum grade, but they are still worthy of custom feedback. On the other hand, in topics such as data science, where performance is of interest, such submissions should preferably be discouraged and only rewarded with partial points [57]. Takeaway: By failing to automatically classify certain supposedly correct submissions, the grader can reveal the submissions that require further manual inspection, and help detect suboptimal solutions.
Fig. 5:
Singleton clusters from the gcd exercise, pointing to submissions that require further manual inspection.
The Role of Instructors. Our grader helped us identify not only unique solutions, but also whole clusters of suboptimal solutions. For example, consider the non-singleton clusters of submissions for the prime exercise (Figure 4c). Our reference solution (node R) checks if the input integer n is prime by dividing n by each positive integer up to the square root of n, or until a divisor is found. Interestingly, the most popular solution strategy among students turned out to be checking all the way to n instead.3 Stainless could not prove this strategy equivalent to our reference solution, forming a separate cluster of size 7. Takeaway: Every submission that results in a timeout (passes the tests but is not provably correct) should be checked by instructors. With the help of clustering, the number of manual checks reduces to one representative submission check per cluster. Upon manual inspection, correct submissions can be promoted to reference solutions. This way, the set of tests and reference solutions grows over time, to the point where only new singleton submissions are checked manually.
Library Functions. To tame the disparity between the Scala List class and the Stainless List class4, we provide a stripped version of Stainless List with the handout, asking the students to use this version instead of Scala’s. This way, we were able to exploit the benefits of formal verification without discouraging students from using library functions. For example, in the drop exercise, we observe submissions using library functions such as foldLeft, length, and size, each forming a separate cluster. One cluster of size two contains tail-recursive programs that use list concatenations. Takeaway: Some grading assistants introduce restrictions on using library functions [11]. Yet, support for library functions is crucial, as it allows exercising functional programming abstractions. Library functions also lead to more diversity in student submissions.
Rigorous Autograding. Our students receive feedback for each (re)submission, allowing them to make progress from safety errors to logical errors, and finally to error-free programs. This descriptive feedback was well received by the students. Several students reported on the course forum that the grader found bugs in their code that they did not detect locally by running the test suite. For example, in the gcd exercise, 7 out of 8 incorrect submissions are due to safety errors in auxiliary functions, that could not be detected by the test suite. The authors of the LAV verifier make similar observations [65]. In their evaluation on C programming assignments, they show that, out of 266 submissions, LAV found 35 incorrect submissions that successfully passed the test suite, mainly due to subtle buffer overflows.
Whereas it is clear that programming assignments are not safety-critical on their own, teaching programming means teaching potential future authors of safety-critical software. Our study advocates for rigorous autograding, in a way that is completely transparent for students (push-button verification [44, 48, 62]). Namely, students are encouraged to think rigorously about program correctness, without the need for any in-depth knowledge about formal methods. This setting is in line with the vision of formal methods thinking in computer science education put forward by B. Dongol et al. [14] Takeaway: Formal verification enriches the feedback given to students. It provides feedback for errors that prevent correctness even if they cannot be formulated easily as counterexamples to program safety.

5 Limitations and Threats to Validity

In this section, we consider some limitations of our approach and our study.
Compilation Errors. We attribute large numbers of compilation errors to students only having local access to the Scala compiler, and not Stainless.5 To reduce the number of compilation errors, we should allow students to run Stainless on their local machines. In each problem statement, we should clearly specify which language constructs are supported and which ones are forbidden. Ideally, we should provide local access to Stainless without safety and termination checks. This point is supported by previous research on the role of feedback, which suggests that introducing some form of delay is better for learning purposes, as immediate feedback is prone to undesirable trial-and-error solving strategies [9]. We notice similar concerns in a related report on the Learn-OCaml web platform [27], posing the question whether limiting the number of resubmissions would reduce the number of trivial and syntactic errors.
Syntax and Style. While it would be possible to further incorporate techniques for syntax and style checking [27, 30], in this report, we focus on semantic program structure defined by recursion schemas. For example, the infix exercise is non-recursive, and therefore all the submissions are in the same cluster, despite students using a mixture of if-then-else, pattern matching, built-in boolean and bitwise operators, and custom (non-recursive) boolean functions. Furthermore, program clustering by recursion schemas in some cases does not differentiate purely functional programs from programs with loops and mutations. The reason is that Stainless automatically transforms loops into tail recursive functions during verification [5]. For example, submission gcdY with a while loop and variable mutation6 is proven equivalent to purely functional submission gcdW, and thus placed in the same cluster (Figure 1b).
To address those limitations, future research could consider a more fine-grained clustering to differentiate recursion and loops, by performing syntax checks complementary to inductive equivalence proofs. Similar checks are already used by MOOC graders [27, 41].
General Approach. Our analysis only uses data from 709 submissions from one course at one university, and does not necessarily represent the general trend in programming education. We found this setting (200 students, 4 exercises, 15 LOC on average) to be a reasonable size for a pilot study, which enabled us to identify interesting aspects of in-class deployment.
In larger assignments, there is often a clear decomposition into smaller functions, guided by the problem statement or a scaffold. Modular equivalence checkers, such as Stainless, should naturally scale to large assignments that consist of several small functions.
While our study uses the Scala language and the Stainless verifier, our approach applies to other languages and verifiers. In particular, equivalence proofs by functional induction are applicable to recursive programs in general [39] (notable examples include functional induction in Coq [32], recursion induction [47] or computation induction [46] in Isabelle, default induction heuristic in ACL2 [34]). Furthermore, tools such as REVE [15], RVT [23] or SymDiff [35] have shown that equivalence checking is also feasible for imperative programs, beyond the examples that we presented in our study.
In this section, we describe related work on functional programming education, state-of-the-art clustering-based grading assistants, and applications of formal verification tools in programming courses.
Functional Programming Education. The Rainfall problem [59], originally studied in the field of imperative programming education, has recently become an insightful benchmark in functional programming education. In [17], the author takes over 200 solutions to the Rainfall problem across five functional-first courses and manually splits the solutions based on structure. The Rainfall studies later inspired evaluation of techniques for scaling program classification using machine learning, assuming that an instructor has indicated categories of interest [12]. In contrast, our approach automatically discovers clusters of submissions using equivalence proofs.
Learn-OCaml is an online grading platform for the OCaml MOOC [8]. Researchers have proposed extensions to the platform allowing assessment of style and test quality [27], as well as understanding how students interact with the grader [20]. Learn-OCaml’s ability to keep track of metrics such as grades, the number of syntax errors and the time spent on each question enables clustering of students into four fixed interaction strategies. In contrast, we consider clustering of submissions, and dynamic discovery of clusters. It would be interesting to combine techniques from Lean-Ocaml with our approach and relate interaction strategies to underlying program structures.
Ask-Elle [21] is an online tutor for introductory Haskell exercises, providing feedback and incremental hints using property-based testing and strategy-based tracing. Like in our experiment, the case studies on Ask-Elle observe different patterns in student submissions, and show how Ask-Elle benefits from having multiple reference solutions that provide strategy-specific guidance. Custom feedback in Ask-Elle and Learn-OCaml comes at a cost of significant manual effort to set up a grader. In Ask-Elle, the instructor provides annotations for reference solutions and manually specifies QuickCheck [10] properties. Similarly, for each new Learn-OCaml exercise, the instructor has to specify custom syntax checks, predict unusual solutions, and write mutants to evaluate student-written tests. In contrast, in our deployment, we only had to provide a reference solution for each new exercise, along with optional unit tests, leaving time for instructors to discuss specific solutions with students.
Program Clustering in Grading Assistants. The idea of using equivalence checking to detect algorithmic similarity was previously explored in the ZEUS grading assistant [11]. Like our grader, ZEUS also relies on SMT solving, but while we use functional induction, ZEUS uses inference rules that simulate relationships between expressions. ZEUS is thus more restricted with respect to recursion and introduces limitations for programs with library functions. While both [11] and [42] focus on empirical evaluation of equivalence checking, the focus of our analysis is on understanding the resulting clusters and corresponding program structures.
OverCode [22] is a grading assistant for large scale courses, providing an interactive user interface for visualizing clusters of solutions. OverCode supports manual manipulation of program clusters, e.g., merging clusters by adding rewrite rules. However, unlike our approach, OverCode performs neither automated testing nor verification to check for program correctness. Neither OverCode nor ZEUS provide counterexamples for incorrect programs. Complementing our approach with OverCode’s user interface and merge rules could be beneficial to ease manipulation of program clusters and counterexamples.
Verification Tools as Grading Assistants. LEGenT [1] is a tool for personalized feedback generation, using Clara [26] for program clustering and REVE [15] to identify provably correct submissions. LAV [65] is a verification tool evaluated on imperative programming assignments. Unlike our approach, which supports recursion, both LEGenT and LAV are targeting non-recursive programs.
Dracula [50, 51, 63] combines the ACL2 theorem prover [34] with the DrScheme graphical interface [16], in introductory programming and software engineering courses. The authors have used Dracula in undergraduate courses on functional programming and software engineering, like is the case in our case study. The main difference is that, in their setting, the goal is to tech students to formulate theorems (programming and proving). In contrast, in our approach, we do not go as far as asking students to systematically prove program properties, or even state them. Instead, in our study, we provide a setting that encourages students to think rigorously about program correctness, without any in-depth knowledge about formal methods. Furthermore, unlike our approach, Dracula does not perform program clustering. Both ACL2 and Stainless have support for automated functional induction, which suggests that it would be possible to perform a similar study in ACL2, even if ACL2 is not higher-order.
Recently, researchers are increasingly sharing their experience on using proof assistants for teaching [3], both for mathematics and computer science programs. Proof assistants have been increasingly finding their way in specialized graduate courses [33, 45, 52], in undergraduate courses [18, 19, 31, 38, 54], and even high schools [4, 25]. The question remains whether theorem provers can offer an adequate sufficiently high-level interface for students to write proofs without having to learn the proof assistant itself.

7 Conclusions and Future Work

We have reported our experience in using a formal verifier for evaluation of assignments in an undergraduate programming course. We found that formal verification enriches the feedback given to students. Moreover, verification based on functional induction allowed us to differentiate between solutions, even when solutions exhibit the same input-output behaviour. It allowed us to propose additional reference solutions, and to focus our attention on unusual solutions. We are therefore confident that this approach represents a useful addition to automating assignment evaluation.
We have shared our main takeaways and our deployment process in detail, with the hope that our study will provide inspiration for others trying to incorporate program verifiers and program clustering in assignment assessment in their own courses. In the future, we plan to progressively use our approach on more exercises in our course. To improve the quality of feedback reported to students, we will continuously grow the set of reference solutions and provide more refined verification outcome summaries. In the long run, future research could also consider metrics to explicitly cluster solutions by efficiency. Such feature would be particularly desirable in competitive programming and algorithms courses [57, 68].

Availability of Data and Software

The complete data set from this paper is available in an open access Zenodo repository [43].

Disclosure of Interests.

The authors have no competing interests to declare that are relevant to the content of this article.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://​creativecommons.​org/​licenses/​by/​4.​0/​), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Fußnoten
1
Stainless also supports imperative programs [6, 28, 56].
 
2
In a recent systematic review [40], the authors analyse 121 research papers in the field and remark that, indeed, only 10 of them have publicly available data sets. We are grateful to EPFL’s human research ethics committee and colleagues for their help throughout this study and to our students for allowing us to share their submissions. The overall process involved a significant administrative overhead, which may partly explain the scarcity of public data sets in the literature on programming education.
 
3
This was also the case for the majority of submissions discarded due to compilation errors, implementing the same technique using for-comprehensions.
 
4
The implementation of the List class in the Scala library internally mutates the tail for efficiency. To facilitate verification, the Stainless library provides two simpler implementations (an invariant and a covariant list).
 
5
Stainless only supports a subset of the Scala language (Section 2).
 
6
Stainless supports a limited form of side effects, such as mutation to mostly non-aliased state, and internally handles loops by transformations to recursive functions.
 
Literatur
2.
Zurück zum Zitat Barbosa, H., Barrett, C., Brain, M., Kremer, G., Lachnitt, H., Mann, M., Mohamed, A., Mohamed, M., Niemetz, A., Nötzli, A., Ozdemir, A., Preiner, M., Reynolds, A., Sheng, Y., Tinelli, C., Zohar, Y.: cvc5: A Versatile and Industrial-Strength SMT Solver. In: Tools and Algorithms for the Construction and Analysis of Systems. pp. 415–442 (2022). https://doi.org/10.1007/978-3-030-99524-9_24 Barbosa, H., Barrett, C., Brain, M., Kremer, G., Lachnitt, H., Mann, M., Mohamed, A., Mohamed, M., Niemetz, A., Nötzli, A., Ozdemir, A., Preiner, M., Reynolds, A., Sheng, Y., Tinelli, C., Zohar, Y.: cvc5: A Versatile and Industrial-Strength SMT Solver. In: Tools and Algorithms for the Construction and Analysis of Systems. pp. 415–442 (2022). https://​doi.​org/​10.​1007/​978-3-030-99524-9_​24
3.
Zurück zum Zitat Bartzia, E., Meyer, A., Narboux, J.: Proof assistants for undergraduate mathematics and computer science education: elements of a priori analysis. In: Trigueros, M. (ed.) INDRUM 2022: Fourth conference of the International Network for Didactic Research in University Mathematics. Reinhard Hochmuth, HAL, Hanovre, Germany (Oct 2022), https://hal.science/hal-03648357 Bartzia, E., Meyer, A., Narboux, J.: Proof assistants for undergraduate mathematics and computer science education: elements of a priori analysis. In: Trigueros, M. (ed.) INDRUM 2022: Fourth conference of the International Network for Didactic Research in University Mathematics. Reinhard Hochmuth, HAL, Hanovre, Germany (Oct 2022), https://​hal.​science/​hal-03648357
9.
Zurück zum Zitat Chevalier, M., Giang, C., El-Hamamsy, L., Bonnet, E., Papaspyros, V., Pellet, J.P., Audrin, C., Romero, M., Baumberger, B., Mondada, F.: The role of feedback and guidance as intervention methods to foster computational thinking in educational robotics learning activities for primary school. Computers & Education 180, 104431 (2022). https://doi.org/10.1016/j.compedu.2022.104431 Chevalier, M., Giang, C., El-Hamamsy, L., Bonnet, E., Papaspyros, V., Pellet, J.P., Audrin, C., Romero, M., Baumberger, B., Mondada, F.: The role of feedback and guidance as intervention methods to foster computational thinking in educational robotics learning activities for primary school. Computers & Education 180, 104431 (2022). https://​doi.​org/​10.​1016/​j.​compedu.​2022.​104431
10.
Zurück zum Zitat Claessen, K., Hughes, J.: Quickcheck: a lightweight tool for random testing of Haskell programs. In: Proceedings of the Fifth ACM SIGPLAN International Conference on Functional Programming. p. 268–279. ICFP ’00 (2000). https://doi.org/10.1145/351240.351266 Claessen, K., Hughes, J.: Quickcheck: a lightweight tool for random testing of Haskell programs. In: Proceedings of the Fifth ACM SIGPLAN International Conference on Functional Programming. p. 268–279. ICFP ’00 (2000). https://​doi.​org/​10.​1145/​351240.​351266
13.
Zurück zum Zitat De Moura, L., Bjørner, N.: Z3: An Efficient SMT Solver. In: Proceedings of the Theory and Practice of Software, 14th International Conference on Tools and Algorithms for the Construction and Analysis of Systems. p. 337–340. TACAS’08/ETAPS’08 (2008), https://doi.org/10.1007/978-3-540-78800-3_24 De Moura, L., Bjørner, N.: Z3: An Efficient SMT Solver. In: Proceedings of the Theory and Practice of Software, 14th International Conference on Tools and Algorithms for the Construction and Analysis of Systems. p. 337–340. TACAS’08/ETAPS’08 (2008), https://​doi.​org/​10.​1007/​978-3-540-78800-3_​24
15.
Zurück zum Zitat Felsing, D., Grebing, S., Klebanov, V., Rümmer, P., Ulbrich, M.: Automating Regression Verification. In: Proceedings of the 29th ACM/IEEE International Conference on Automated Software Engineering. p. 349–360. ASE ’14 (2014). https://doi.org/10.1145/2642937.2642987 Felsing, D., Grebing, S., Klebanov, V., Rümmer, P., Ulbrich, M.: Automating Regression Verification. In: Proceedings of the 29th ACM/IEEE International Conference on Automated Software Engineering. p. 349–360. ASE ’14 (2014). https://​doi.​org/​10.​1145/​2642937.​2642987
18.
Zurück zum Zitat From, A., Jacobsen, F., Villadsen, J.: SeCaV: A sequent calculus verifier in Isabelle/HOL. In: Proceedings of 16th Logical and Semantic Frameworks with Applications. Electronic Proceedings in Theoretical Computer Science, EPTCS, vol. 357, pp. 38–55 (2022). https://doi.org/10.4204/EPTCS.357.4 From, A., Jacobsen, F., Villadsen, J.: SeCaV: A sequent calculus verifier in Isabelle/HOL. In: Proceedings of 16th Logical and Semantic Frameworks with Applications. Electronic Proceedings in Theoretical Computer Science, EPTCS, vol. 357, pp. 38–55 (2022). https://​doi.​org/​10.​4204/​EPTCS.​357.​4
19.
Zurück zum Zitat Gambhir, S., Guilloud, S., Milovančević, D., Rümmer, P., Kunčak, V.: Lisa tool integration and education plans (2023) Gambhir, S., Guilloud, S., Milovančević, D., Rümmer, P., Kunčak, V.: Lisa tool integration and education plans (2023)
20.
Zurück zum Zitat Geng, C., Xu, W., Xu, Y., Pientka, B., Si, X.: Identifying Different Student Clusters in Functional Programming Assignments: From Quick Learners to Struggling Students. In: Proceedings of the 54th ACM Technical Symposium on Computer Science Education. p. 750–756. SIGCSE 2023 (2023). https://doi.org/10.1145/3545945.3569882 Geng, C., Xu, W., Xu, Y., Pientka, B., Si, X.: Identifying Different Student Clusters in Functional Programming Assignments: From Quick Learners to Struggling Students. In: Proceedings of the 54th ACM Technical Symposium on Computer Science Education. p. 750–756. SIGCSE 2023 (2023). https://​doi.​org/​10.​1145/​3545945.​3569882
24.
Zurück zum Zitat Griswold, W.G.: Experience Report: Meet the Professor - A Large-Course Intervention for Increasing Rapport. In: Proceedings of the 55th ACM Technical Symposium on Computer Science Education V. 1. p. 415–421. SIGCSE 2024 (2024). https://doi.org/10.1145/3626252.3630844 Griswold, W.G.: Experience Report: Meet the Professor - A Large-Course Intervention for Increasing Rapport. In: Proceedings of the 55th ACM Technical Symposium on Computer Science Education V. 1. p. 415–421. SIGCSE 2024 (2024). https://​doi.​org/​10.​1145/​3626252.​3630844
28.
Zurück zum Zitat Hamza, J., Felix, S., Kunčak, V., Nussbaumer, I., Schramka, F.: From Verified Scala to STIX File System Embedded Code Using Stainless. In: Deshmukh, J.V., Havelund, K., Perez, I. (eds.) NASA Formal Methods. pp. 393–410. Springer International Publishing, Cham (2022). https://doi.org/10.1007/978-3-031-06773-0_21 Hamza, J., Felix, S., Kunčak, V., Nussbaumer, I., Schramka, F.: From Verified Scala to STIX File System Embedded Code Using Stainless. In: Deshmukh, J.V., Havelund, K., Perez, I. (eds.) NASA Formal Methods. pp. 393–410. Springer International Publishing, Cham (2022). https://​doi.​org/​10.​1007/​978-3-031-06773-0_​21
30.
Zurück zum Zitat Hart, R., Hays, B., McMillin, C., Rezig, E.K., Rodriguez-Rivera, G., Turkstra, J.A.: Eastwood-Tidy: C Linting for Automated Code Style Assessment in Programming Courses. In: Proceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1. p. 799–805. SIGCSE 2023 (2023). https://doi.org/10.1145/3545945.3569817 Hart, R., Hays, B., McMillin, C., Rezig, E.K., Rodriguez-Rivera, G., Turkstra, J.A.: Eastwood-Tidy: C Linting for Automated Code Style Assessment in Programming Courses. In: Proceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1. p. 799–805. SIGCSE 2023 (2023). https://​doi.​org/​10.​1145/​3545945.​3569817
35.
Zurück zum Zitat Lahiri, S.K., Hawblitzel, C., Kawaguchi, M., Rebêlo, H.: SYMDIFF: A Language-Agnostic Semantic Diff Tool for Imperative Programs. In: Proceedings of the 24th International Conference on Computer Aided Verification. p. 712–717. CAV’12, Springer-Verlag, Berlin, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31424-7_54 Lahiri, S.K., Hawblitzel, C., Kawaguchi, M., Rebêlo, H.: SYMDIFF: A Language-Agnostic Semantic Diff Tool for Imperative Programs. In: Proceedings of the 24th International Conference on Computer Aided Verification. p. 712–717. CAV’12, Springer-Verlag, Berlin, Heidelberg (2012). https://​doi.​org/​10.​1007/​978-3-642-31424-7_​54
37.
Zurück zum Zitat Lee, J., Song, D., So, S., Oh, H.: Automatic Diagnosis and Correction of Logical Errors for Functional Programming Assignments. OOPSLA, Association for Computing Machinery, New York, NY, USA (2018). https://doi.org/10.1145/3276528 Lee, J., Song, D., So, S., Oh, H.: Automatic Diagnosis and Correction of Logical Errors for Functional Programming Assignments. OOPSLA, Association for Computing Machinery, New York, NY, USA (2018). https://​doi.​org/​10.​1145/​3276528
38.
Zurück zum Zitat Maxim, H., Kaliszyk, C., van Raamsdonk, F., Wiedijk, F.: Teaching logic using a state-of-art proof assistant. Acta Didactica Napocensia (2010) Maxim, H., Kaliszyk, C., van Raamsdonk, F., Wiedijk, F.: Teaching logic using a state-of-art proof assistant. Acta Didactica Napocensia (2010)
39.
41.
Zurück zum Zitat Miller, H., Haller, P., Rytz, L., Odersky, M.: Functional programming for all! scaling a MOOC for students and professionals alike. In: ICSE Companion. pp. 256–263. ACM (2014) Miller, H., Haller, P., Rytz, L., Odersky, M.: Functional programming for all! scaling a MOOC for students and professionals alike. In: ICSE Companion. pp. 256–263. ACM (2014)
44.
Zurück zum Zitat Nelson, L., Sigurbjarnarson, H., Zhang, K., Johnson, D., Bornholt, J., Torlak, E., Wang, X.: Hyperkernel: Push-Button Verification of an OS Kernel. In: Proceedings of the 26th Symposium on Operating Systems Principles. p. 252–269. SOSP ’17, Association for Computing Machinery, New York, NY, USA (2017). https://doi.org/10.1145/3132747.3132748 Nelson, L., Sigurbjarnarson, H., Zhang, K., Johnson, D., Bornholt, J., Torlak, E., Wang, X.: Hyperkernel: Push-Button Verification of an OS Kernel. In: Proceedings of the 26th Symposium on Operating Systems Principles. p. 252–269. SOSP ’17, Association for Computing Machinery, New York, NY, USA (2017). https://​doi.​org/​10.​1145/​3132747.​3132748
45.
Zurück zum Zitat Nipkow, T.: Teaching Semantics with a Proof Assistant: No More LSD Trip Proofs. In: Kuncak, V., Rybalchenko, A. (eds.) Verification, Model Checking, and Abstract Interpretation. pp. 24–38. Springer Berlin Heidelberg, Berlin, Heidelberg (2012) Nipkow, T.: Teaching Semantics with a Proof Assistant: No More LSD Trip Proofs. In: Kuncak, V., Rybalchenko, A. (eds.) Verification, Model Checking, and Abstract Interpretation. pp. 24–38. Springer Berlin Heidelberg, Berlin, Heidelberg (2012)
48.
Zurück zum Zitat Oberhauser, J., Chehab, R.L.d.L., Behrens, D., Fu, M., Paolillo, A., Oberhauser, L., Bhat, K., Wen, Y., Chen, H., Kim, J., Vafeiadis, V.: Vsync: push-button verification and optimization for synchronization primitives on weak memory models. In: Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. p. 530–545. ASPLOS ’21, Association for Computing Machinery, New York, NY, USA (2021). https://doi.org/10.1145/3445814.3446748 Oberhauser, J., Chehab, R.L.d.L., Behrens, D., Fu, M., Paolillo, A., Oberhauser, L., Bhat, K., Wen, Y., Chen, H., Kim, J., Vafeiadis, V.: Vsync: push-button verification and optimization for synchronization primitives on weak memory models. In: Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. p. 530–545. ASPLOS ’21, Association for Computing Machinery, New York, NY, USA (2021). https://​doi.​org/​10.​1145/​3445814.​3446748
51.
Zurück zum Zitat Page, R., Eastlund, C., Felleisen, M.: Functional programming and theorem proving for undergraduates: a progress report. In: Proceedings of the 2008 International Workshop on Functional and Declarative Programming in Education. p. 21–30. FDPE’08 (2008). https://doi.org/10.1145/1411260.1411264 Page, R., Eastlund, C., Felleisen, M.: Functional programming and theorem proving for undergraduates: a progress report. In: Proceedings of the 2008 International Workshop on Functional and Declarative Programming in Education. p. 21–30. FDPE’08 (2008). https://​doi.​org/​10.​1145/​1411260.​1411264
53.
Zurück zum Zitat Pu, Y., Narasimhan, K., Solar-Lezama, A., Barzilay, R.: Sk_p: A neural program corrector for moocs. In: Companion Proceedings of the 2016 ACM SIGPLAN International Conference on Systems, Programming, Languages and Applications: Software for Humanity. p. 39–40. SPLASH Companion 2016 (2016). https://doi.org/10.1145/2984043.2989222 Pu, Y., Narasimhan, K., Solar-Lezama, A., Barzilay, R.: Sk_p: A neural program corrector for moocs. In: Companion Proceedings of the 2016 ACM SIGPLAN International Conference on Systems, Programming, Languages and Applications: Software for Humanity. p. 39–40. SPLASH Companion 2016 (2016). https://​doi.​org/​10.​1145/​2984043.​2989222
54.
Zurück zum Zitat Rousselin, P.: Mathematics with Coq for first-year undergraduate students (2023) Rousselin, P.: Mathematics with Coq for first-year undergraduate students (2023)
56.
Zurück zum Zitat Schmid, G.S., Kuncak, V.: Generalized arrays for stainless frames. In: Finkbeiner, B., Wies, T. (eds.) Verification, Model Checking, and Abstract Interpretation - 23rd International Conference, VMCAI 2022, Philadelphia, PA, USA, January 16-18, 2022, Proceedings. Lecture Notes in Computer Science, vol. 13182, pp. 332–354. Springer (2022). https://doi.org/10.1007/978-3-030-94583-1_17 Schmid, G.S., Kuncak, V.: Generalized arrays for stainless frames. In: Finkbeiner, B., Wies, T. (eds.) Verification, Model Checking, and Abstract Interpretation - 23rd International Conference, VMCAI 2022, Philadelphia, PA, USA, January 16-18, 2022, Proceedings. Lecture Notes in Computer Science, vol. 13182, pp. 332–354. Springer (2022). https://​doi.​org/​10.​1007/​978-3-030-94583-1_​17
57.
Zurück zum Zitat Singh, A., Fariha, A., Brooks, C., Soares, G., Henley, A.Z., Tiwari, A., M, C., Choi, H., Gulwani, S.: Investigating student mistakes in introductory data science programming. In: Proceedings of the 55th ACM Technical Symposium on Computer Science Education V. 1. p. 1258–1264. SIGCSE 2024 (2024). https://doi.org/10.1145/3626252.3630884 Singh, A., Fariha, A., Brooks, C., Soares, G., Henley, A.Z., Tiwari, A., M, C., Choi, H., Gulwani, S.: Investigating student mistakes in introductory data science programming. In: Proceedings of the 55th ACM Technical Symposium on Computer Science Education V. 1. p. 1258–1264. SIGCSE 2024 (2024). https://​doi.​org/​10.​1145/​3626252.​3630884
60.
62.
Zurück zum Zitat Tao, R., Shi, Y., Yao, J., Li, X., Javadi-Abhari, A., Cross, A.W., Chong, F.T., Gu, R.: Giallar: push-button verification for the qiskit quantum compiler. In: Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation. p. 641–656. PLDI 2022, Association for Computing Machinery, New York, NY, USA (2022). https://doi.org/10.1145/3519939.3523431 Tao, R., Shi, Y., Yao, J., Li, X., Javadi-Abhari, A., Cross, A.W., Chong, F.T., Gu, R.: Giallar: push-button verification for the qiskit quantum compiler. In: Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation. p. 641–656. PLDI 2022, Association for Computing Machinery, New York, NY, USA (2022). https://​doi.​org/​10.​1145/​3519939.​3523431
66.
Zurück zum Zitat Wang, K., Singh, R., Su, Z.: Search, align, and repair: Data-driven feedback generation for introductory programming exercises. In: Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation. p. 481–495. PLDI 2018, Association for Computing Machinery, New York, NY, USA (2018). https://doi.org/10.1145/3192366.3192384 Wang, K., Singh, R., Su, Z.: Search, align, and repair: Data-driven feedback generation for introductory programming exercises. In: Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation. p. 481–495. PLDI 2018, Association for Computing Machinery, New York, NY, USA (2018). https://​doi.​org/​10.​1145/​3192366.​3192384
67.
Metadaten
Titel
Formal Autograding in a Classroom
verfasst von
Dragana Milovančević
Mario Bucev
Marcin Wojnarowski
Samuel Chassot
Viktor Kunčak
Copyright-Jahr
2025
DOI
https://doi.org/10.1007/978-3-031-91121-7_7