Skip to main content
Erschienen in: International Journal of Information Security 1/2024

Open Access 10.08.2023 | Regular contribution

Simulating all archetypes of SQL injection vulnerability exploitation using reinforcement learning agents

verfasst von: Åvald Åslaugson Sommervoll, László Erdődi, Fabio Massimo Zennaro

Erschienen in: International Journal of Information Security | Ausgabe 1/2024

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Vulnerabilities such as SQL injection represent a serious challenge to security. While tools with a pre-defined logic are commonly used in the field of penetration testing, the continually evolving nature of the security challenge calls for models able to learn autonomously from experience. In this paper we build on previous results on the development of reinforcement learning models devised to exploit specific forms of SQL injection, and we design agents that are able to tackle a varied range of SQL injection vulnerabilities, virtually comprising all the archetypes normally considered by experts. We show that our agents, trained on a synthetic environment, perform a transfer of learning among the different SQL injections challenges; in particular, they learn to use their queries to efficiently gain knowledge about multiple vulnerabilities at once. We also introduce a novel and more versatile way to interpret server messages that reduces reliance on expert inputs. Our simulations show the feasibility of our approach which easily deals with a number of homogeneous challenges, as well as some of its limitations when presented with problems having higher degrees of uncertainty.
Hinweise
László Erdődi and Fabio Massimo Zennaro have contributed equally to this work.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

1 Introduction

SQL injection (SQLi) is a complex and well-known type of web vulnerability exploitation. It is ranked number three in the OWASP Web Application Security Risks [1]. In 2020 there were 462 new SQLi vulnerabilities registered in the Common Vulnerabilities and Exposures (CVE) database [2], in 2021, 738, and as of May 25, 2022, a further 626 products with SQLi vulnerabilities were recorded; following this trajectory, there may be 1575 new registered vulnerabilities by the end of the same year. From these numbers, it is clear that the trend of SQLi vulnerabilities is increasing when looking at year-to-year data. Experts often test or investigate products before or even during release to avoid such unintended vulnerabilities. This verification is sometimes carried out with the aid of automated tools, such as SQLmap [3]. However, these tools mainly detect specific or low-complexity vulnerabilities, as they check for well-defined strategies and test predefined requests for exploitation. Indeed, automation of the exploitation is far from being solved, and failure to deal with more complex forms of exploitation highlights the current limits of available tools. The best example of unsuccessful exploitation automation is capture-the-flag (CTF) style SQLi challenges; many of these challenges can be solved with tools like SQLmap only after a human expert has given guidance to restrict the search process. Complex SQLi exploitation nowadays is a highly hybrid task involving manual vulnerability mapping by experts complemented by exploitation tools.
Machine learning (ML) offers an avenue to increase the autonomy of exploitation tools for vulnerability detection and prevention. ML has recently proved very effective in solving complex optimization tasks and in reproducing intelligent human behavior. As ML models evolve and computational resources increase, it is reasonable to wonder whether the human contribution to SQL exploitation could be replaced or improved (in terms of precision and speed) by such algorithms. ML models could be trained to mimic the behavior of an expert, learn from experience, infer and apply the best exploitation strategies to various problems. Reinforcement learning (RL) is a particularly suited paradigm for this type of challenge, and it has led to relevant breakthroughs in very challenging application domains, such as games [4] or biomedicine [5]. Promising applications have also been previously explored in the area of penetration testing, and SQLi exploitation [6, 7], showing that providing a RL agent with solid priors about the structure of the problem could lead to the successful solution of simple union-based SQLi problems. This paper applies a similar approach to solve a more realistic multi-task problem, in which a RL agent is not faced with a single type of SQLi vulnerability but with a range of related, yet different, vulnerabilities. Specifically, we consider:
1.
no vulnerability
 
2.
stack-based
 
3.
union-based
 
4.
boolean-based blind
 
5.
error-based
 
6.
time-based blind.
 
To instantiate these vulnerabilities, we rely on a simulated environment. Three simulations with increasing complexity are run to show how our agent learns and adapts in different environments and how the addition of certain vulnerability types impacts learning.
Our implementation of the RL model builds on previous models [6, 7] but improves on its realism by extending the range of challenges the agent has to deal with and using inputs and outputs that require less external human processing. Our simulations show that our RL agent can successfully distinguish and exploit all the SQLi vulnerabilities considered. Additionally, the agent’s training illustrates its adaptability; its queries are sometimes capable of probing for a specific vulnerability while gaining general knowledge that will help with the other exploitations. The ability to deal with varying forms of SQLi vulnerabilities is something expert penetration testers deal with every day, and our work represents the first study in endowing a RL agent with a similar skill for penetration testing.

1.1 Summary of contributions

This work further develops the contribution of Erdodi et al. [6], and Del Verme et al. [7], which only handled union-based vulnerabilities without considering other vulnerability types. Our main contributions are as follows. (1) We extend this approach to all the archetypes of SQL injection vulnerabilities. (2) In order to make the simulation more realistic, our work is the first not to assume that the simulated website is vulnerable. This forces the agent to investigate the website more thoroughly and come to a decision, as penetration testers would do, on whether the website is vulnerable to SQL injection exploitation or not. (3) We simplified the preprocessing. In previous works, human input was necessary to convert server responses to meaningful inputs for the agent. We show that our model can deal with the raw message by relying on features of the HTML response, such as its length or keywords. This also gives the agent higher control, potentially allowing future agents to make more interesting inferences and decisions. (4) Last, this work also shows that the agent exhibits a degree of transfer learning, meaning that it does not learn to solve each SQL vulnerability in isolation, but it can use queries to gain information about different types of SQL injection vulnerabilities. In conclusion, we showcase more performant RL agents that can tackle a wider variety of SQL injection-related problems.

1.2 Structure of the paper

The rest of the paper is organized as follows. Section 2 contains background information, as well as a review of the literature. Section 3 covers the details of our RL problem. After that, we list and discuss our results in Sect. 4. Finally, in Sect. 5, we conclude our study and list possible future work.

2 Background

In this section, we first review CTF challenges and discuss different types of SQLi; we then provide the basic concept underlying the RL modeling approach and cover related work in the literature.

2.1 Capture-the-flag challenges

In the world of ethical hacking and penetration testing, CTF challenges are often used to measure the skill and ability of an expert or a team of experts. A committee sets up a mock environment containing potential vulnerabilities, and one or more teams of experts are tasked with identifying the vulnerabilities and hacking them; a flag hidden behind each vulnerability is used as proof of a successful exploitation. Upon submission of a valid flag, a team is awarded points for completing the challenge. This setup is straightforward, flexible, and conforms well to a typical RL problem in which an agent interacts with an environment and receives rewards for completing its tasks.

2.2 SQL injection

Most websites are supported by a database, accessed based on some user input. For example, when using a login form, often there is a SQL query accessing the database to check the validity of the entered username and password. This query is typically hidden from the user. If this hidden query1 does not have proper input validation, the website has an SQLi vulnerability. In this case, a malicious user can craft an input that would allow her to alter the interpretation of the SQL query in order to execute customized requests and potentially obtain protected information.
Discovering and exploiting a SQLi vulnerability is hard since the attacker has limited information on what is happening server-side. A first step usually includes an exploratory phase during which the attacker sends requests and analyzes responses in order to learn more about the target system, for instance:
  • how the input data2 is placed into the hidden SQL query,
  • what types of inputs are permitted by the website,
  • what is the hidden query or queries that the server-side script uses to fetch the requested data from the databases,
  • what is the database answer that was received by the server-side script,
  • how the database response influences the client-side version of the website.
After this preparatory phase of information gathering, the attacker may try to perform a SQLi exploitation. Based on what she has learned, the attacker may identify a specific SQLi scenario; five archetypical vulnerability exploitation scenarios are:
  • Stacked-based exploitation
  • Union-based exploitation
  • Boolean-based blind exploitation
  • Error-based exploitation
  • Time-based blind exploitation
Notice that, in addition to these five scenarios, the attacker may also decide that the target server exhibits no vulnerability. The above exploitations differ both in the information that an attacker can collect during the exploratory phase (informative responses, binary outputs, error messages, time information) and the actual form of exploitation (execution of arbitrary code, information disclosure).

2.2.1 Stacked-based exploitation

In stack-based exploitation, the attacker receives informative responses from the website (e.g., webpages with different content). An SQLi is performed by closing the hidden query with a semicolon and adding an arbitrary second query. In order to properly close the hidden query and not raise a syntax error, the attacker might need to place the correct string closing sign if the user data is inserted, for instance, between quotation marks. Moreover, this exploitation is only possible if the SQL engine settings enable sending multiple semicolon-separated queries in one line. For these reasons, during the exploratory phase, the attacker relies on the information disclosed by the server in its answers to discover the escape character used in the hidden query. Then she has to verify whether or not using semicolon-separated query chains is allowed by the SQL settings mapping. For example, assume the original hidden query is:
SELECT original_fields from original_table where some_column=’{input_here}’;
then, upon learning the escape character (), and the possibility of performing semicolon-separated query chains, the attacker may perform her SQLi by sending the string:
’; select attacker_data from attacker_table;#.
Though this query looks straightforward, finding the relevant table name and column name is not always easy and may take a varying number of queries depending on many factors. Given that we have the correct escape character and know it is vulnerable to stack-based vulnerabilities, this just requires a single query. Missing this information, typically it may be necessary to execute tens of queries for this kind of exploit.

2.2.2 Union-based exploitation

In union-based exploitation, the attacker again receives informative responses from the website, but it now performs SQLi by concatenating two queries with the union keyword. As in the case of stack-based exploitation, the attacker first needs to find the right escape character and then she has to use the correct column syntax, that is, she must match the number of columns and the data types of the columns in the hidden query. Last, she must verify if the union statement is enabled via SQL settings mapping. For example, assume the original hidden query is:
SELECT c1,c2,c3 from original_table where some_column =’{input_here}’;
then, upon learning the escape character (), the presence of three columns with their data types, and the possibility of performing union-based queries, the attacker may perform her SQLi by sending the string:
’ union select attacker_data1, attacker_data2, attacker _data3 from attacker_table;#.
Similarly to stack-based vulnerabilities, the attack depends on what the attacker wishes to achieve. This attack has a higher complexity than stack-based exploitation, but it can still be achieved with a number of queries in the order of tens.

2.2.3 Boolean-based blind exploitation

In Boolean-based blind exploitation, the attacker is forced to reconstruct a possible SQLi relying only on binary responses from the target server, typically in the form of two distinct web pages. The attacker can evaluate these responses as truth-value answers to her requests. The attacker can map the database by relying on logic operators and performing a series of requests. In order to carry out this protocol, the attacker must find out during the exploratory phase the escape character of the hidden query, realize the underlying logical expression in the hidden query, and figure out which webpage maps to true and which maps to false. For example, assume the original hidden query is:
SELECT original_fields from original_table where some_column=’{input_here}’;
then, upon learning the escape character (), and assessing the plain logic of the hidden query, the attacker may start her probing with a query such as:
’ or ASCII(Substr((SELECT@@VERSION),1,1))\(<\,\)64;#.
The above example would not be enough to find any flag or conduct an exploit but marks the beginning of an exploit. Boolean-based blind exploitation is considerably heavier than union- and stack-based exploitation as we only get Boolean feedback. Therefore this kind of exploit, almost oblivious of the target vulnerability, will require more queries; experts, therefore, experts use 10 to 20 times as many queries for Boolean-based blind exploitation as for union- or stack-based exploitation. For this reason, many of these exploits are automated with tools such as SQLmap.

2.2.4 Error-based exploitation

In error-based exploitation, the attacker can observe the SQL engine’s error messages. Parsing the SQL error, the attacker can find table names, column names, and table information. To extract this information, the attacker must learn during the exploratory phase how the original hidden query is used to fetch data using the user input and extract information concerning the error message from the webpage returned by the server. These kinds of exploits depend heavily on the error messages received but are typically very fast and are similar to stack-based exploits in the number of queries experts use.

2.2.5 Time-based blind exploitation

In time-based blind exploitation, the attacker does not learn anything from the HTML page itself, but instead she has to rely on the response time of the HTML page. By looking at the latency, the attacker may reason that a longer response time may correspond to a valid query that uses time to access the database. Furthermore, a shorter response time may indicate that either the database was not accessed or relevant query data was limited. This time-based blind exploitation, as in the case of Boolean-based blind exploitation, is based on binary information. However, it now uses response time and latency information to evaluate the answers as true or false. However, time-based blind exploitation also has to handle the uncertainty in the response time. Due to traffic, a slow response does not have to correspond to a true query but can be due to traffic issues. For this reason, individual requests are typically sent multiple times to compensate for the uncertainty of response time. The relatively low information leak paired with the unreliability of the response due to traffic makes this exploit the heaviest and currently is done with the help of tools such as SQLmap. This exploit is similar to Boolean-based blind exploitation; however, since we need to send the same query multiple times, the required number of queries may be up to threefold that of Boolean-based blind exploitation.

2.3 Reinforcement learning

RL is a ML method for solving optimization problems in dynamic environments. A RL problem is expressed as an agent acting in an environment [8]. The goal for this agent is to learn an optimal policy that maximizes its goals or its satisfaction. The agent can take actions in the environment, observe changes, and collect rewards. By inferring the relationship between its actions and the reward amount, the agent reinforces those behaviors that led to high rewards.
Formally, a RL problem is defined by a tuple \(({\mathcal {S}},{\mathcal {A}},{T},{R})\) constituting a Markov Decision Process (MDP), where:
  • \({\mathcal {S}}\) is a set of states for the environment;
  • \({\mathcal {A}}\) is a set of actions that the agent can take;
  • \(T:{\mathcal {S}} \times {\mathcal {A}} \rightarrow {\mathcal {S}}\) is a transition function describing how the environment evolves from one state to another after the agent takes an action;
  • \(R:{\mathcal {S}} \times {\mathcal {A}} \rightarrow {\mathbb {R}}\) is a reward function that quantifies the goodness of taking an action in a given state.
In this setup, the objective of the agent is to find the action policy \(\pi \) that maximizes the sum of its rewards in the long term, that is:
$$\begin{aligned} \arg \max _{\pi } \sum _{t=0}^T \gamma ^t E[r_t], \end{aligned}$$
where t is a time-index, T is the time horizon of the problem, \(\gamma \) is a discount factor that favours immediate rewards over rewards far in the future, \(r_t\) is the reward obtained at step t, and \(E[\cdot ]\) is the expected value [9].
Practically, an agent interacts with an environment for a number of episodes e, each one consisting of T steps; during these interactions, it learns and finetunes its policy \(\pi \) using a RL algorithm.
A well-known RL algorithm to train an agent is Q-learning; this algorithm joins conceptual simplicity with good performance. In its tabular version, the Q-learning algorithm requires the instantiation of a matrix tracking the value of each action in each state. The agent updates this matrix by changing the value of the current state (\(S_t\)) as it is about to enter a new state (\(S_{t+1}\)), based on the action chosen (\(A_t\)) and the reward received (\(R_t\)). Formally, this can be expressed as:
$$\begin{aligned} \small Q(S_t, A_t)&\longleftarrow Q(S_t, A_t) + \alpha [R_{t+1} \nonumber \\&+ \gamma \max _{a} Q(S_{t+1}, a) - Q(S_t, A_t)], \end{aligned}$$
(1)
where Q is the action-value matrix and \(\alpha \) is the learning rate that regulates the magnitude of the updates [9]. By looking up its Q-table, the agent can easily decide the best action in a given state. During learning, however, the agent has to balance between exploitation (selecting an action currently considered the best) and exploration (trying out a random action to assess its consequences). This trade-off can be heuristically solved by introducing an exploration rate parameter, \(0 \le \epsilon \le 1\), which, at each step, gives the probability for the agent of selecting a random action. This allows an agent to explore more and lessens the chance of being stuck in a local optimum. In large or infinite state spaces, a Q-learning agent cannot explore the whole space; in this case, it aims to find an acceptable sub-optimal policy that allows it to achieve satisfactory results.
SQLi is a well-documented vulnerability [1, 2], and several automatic tools have been developed to help with prevention. Web vulnerability scanners focus on detecting SQLi vulnerabilities relying on predefined requests; tools, such as Acunetix [10], typically send out a few requests for each web input parameter and observe the response to hypothesize whether a vulnerability is present. Vulnerability scanners only perform detection; they are not designed to exploit the vulnerability. This makes vulnerability scanners prone to false positives, as their hypotheses cannot be verified through an actual attempt at exploitation. Exploitation tools, instead, aim at performing exploitations; SQLi specific tools, such as SQLmap, sqlninja, or pangolin, can exploit the vulnerability using a very well-defined attack logic. These tools are also not perfect because they are bound to a pre-encoded logic, and they can produce many false negatives requiring the review of a human expert. State-of-the-art tools for SQLi thus still require the supervision of an experienced human pentester. Taking the human expert out of the loop would require a solution that still uses predefined requests (as the current tools do) but can also improve its performance by learning from previous cases and occasionally exploring new possibilities, as human testers do.
Our approach to solving the SQLi problem by using ML can be placed in the larger field of ML for cyber security. ML for cyber security has been recently used primarily for defensive operations and less for offensive ones [11, 12]; similar to our work, some researchers have considered RL for defensive operations [13, 14]. Our research, however, focuses more on a subgroup of RL for offensive operations, similar to works [6, 7, 1517]. In [15] Bland et al. deploy RL agents in an environment modeled with the extended Petri net formalism put forward by [18]; RL agents take on the role of either attacker or defender. Zennaro and Erdodi [16] exemplify the offensive cyber capabilities of RL agents for pentesting. Ghanem and Chen [17] reintroduce and extend upon their previously introduced IAPTS system [19] which allows RL agents to learn from real-world data with the aid of expert penetration testers in order to speed up and improve manual penetration testing; they show promising results on their experiments. Our work on SQLi addresses a specific subset of penetration testing challenges; however, this is not the first work to use RL for SQLi. Erdodi et al. successfully applied RL on a simplified union-based vulnerability [6]; they did so with a highly structured approach. A year later, Del Verme et al. proposed a less structured approach with the potential of finding unanticipated SQL vulnerabilities [7]. However, the unstructured approach appears computationally expensive; this is part of the motivation behind the adoption of a more structured approach in this work. This work exploits different SQL injection vulnerabilities while [15] focuses on Cross Site Scripting and Phishing. [1619] present a general approach for penetration testing with RL without focusing on any unique exploitation and HTTP level requests. Other works with RL and SQLi like [7] have a different approach by creating the SQL commands without any prior SQL knowledge. [6] exploits only the union-based vulnerabilities. This research takes the RL-based SQL injection exploitation a significant step forward by targeting all SQLi vulnerability archetypes and building the attack from simple web request actions.

2.5 Ethical considerations

Like any penetration testing or ethical hacking tool, an autonomous SQLi agent that identifies and exploits vulnerabilities bears the risk of misuse. The authors want to remark that this work is aimed at testing and improving the security of potentially vulnerable websites and condemn the agent’s use in illegal or unethical activities.

3 Modeling

This section explains how we bring together the SQLi problem with RL through the CTF setting. As discussed, a CTF challenge entails a pentester interacting with the environment, exploring a target, inferring potential vulnerabilities, and finally gaining a flag through exploitation. We follow [16] in modeling this scenario as an MDP, which can be solved via RL. In our case, the CTF setting with a random SQLi vulnerability (or no vulnerability) provides an environment where an agent can take actions, observe effects and transitions, collect rewards, and infer optimal strategies for performing exploits. In the following, we describe in detail how we designed the environment, the set of actions that an agent can take, and the set of states and their evolution. Finally, we present our actual learning algorithm, and we illustrate how all these details come together in the three simulations we have designed.

3.1 Environment

We implemented a dynamic mock-up environment in which, at each instantiation, a server with zero or one SQLi vulnerability is initialized. The environment receives the RL agent’s requests, processes them, and returns a response. The learning agent is expected to interact with the server, detect what sort of vulnerability is present, and then provide the information necessary to start an automated exploit. Each type of SQLi vulnerability has a different type of condition of success.

3.1.1 Stacked-based vulnerability

In the stack-based vulnerability, the attacker obtains a flag by performing an SQLi where its query is chained to the hidden query. To succeed, the agent must discover the escape character and claim the presence of a stack-based vulnerability. In our simulation, we account for the three most common escape characters.

3.1.2 Union-based vulnerability

In the union-based vulnerability, the attacker obtains a flag by performing an SQLi where the hidden query is extended via a UNION operation. To succeed, the agent must find the escape character and infer the number of columns in the original hidden query. In our simulation, we account for three different types of escape characters and let the number of columns be between one and three.

3.1.3 Boolean-based blind vulnerability

In the Boolean-based blind vulnerability, the attacker obtains a flag by disclosing and reconstructing information from binary answers provided by the server. To succeed, the agent must find the escape character and provide the right truth-value to the answers it has collected. In our simulation, we account for three different types of escape characters.

3.1.4 Error-based vulnerability

In the error-based vulnerability, the attacker obtains a flag by reconstructing information from the error messages generated by the server. To succeed, the agent has to determine that an error-based vulnerability is present and return the escape character that may be used to cause errors. In our simulation, we account for three different types of escape characters.

3.1.5 Time-based blind vulnerability

In the time-based blind vulnerability, the attacker obtains a flag by disclosing and reconstructing information through the computation of time required by the server to produce an answer. To succeed, the agent has to give the escape character and whether or not the hidden query is true or false.

3.1.6 No vulnerability

Sometimes a website or challenge has no vulnerability; in this rare case, the best action is to recognize the absence of a vulnerability. The agent may declare that the current challenge has no vulnerability at any point in time. No further information has to be provided in this case. The condition of success is tied to the actual absence of a vulnerability.

3.1.7 Preprocessing of the server response

The exchange of information between the server and the agent is mediated through messages encoding the agent’s actions or the server’s responses. Previous works relied on an external hard-coded preprocessing interface that, in practice, would have to be maintained by an expert. This interface reduced the messages of the server to numeric values that could be easily processed by the agent [6, 7]; in the extreme case, the messages of the server may be mapped to a single bit denoting the success or failure of an exploitation attempt.
Here, we drop part of this assumption. The server produces an HTML response r, and the environment forwards to the agent a message m, which is an automatically computed function f of the HTML response. Our function f is defined as follows:
$$\begin{aligned} \small m=f(r)={\left\{ \begin{array}{ll} -1 &{} \text {if r is the flag }r=flag\\ -2 &{} \text {if r contains the SQL version}\\ -3 &{} \text {if r contains an observable error}\\ len(r) &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
In general, m is simply the length of the server response,3\(m=len(r)\). This heuristic can be fully automated and is justified by the fact that incorrect requests typically give the same generic result in the form of an empty page or an error message. There are many ways to do something wrong, but only a few ways to do something right. An agent can then infer the relevance of an answer by contrasting its length with the length of other responses in a way that is purely syntactical and completely agnostic of the content and its complexity. This approach allows us to rule out the requirement of having an expert mapping the response to a numeric code, as it happens in [6, 7]. Note that while the length is used here, almost any other hash that allows the agent to capture the difference in responses is also viable. The length, in this case, was chosen to emphasize how simple the preprocessing can be and allow for readability. Also, this may enable our agent to make a more refined discrimination than in previous work [6] in which the agent could only distinguish between a positive result and a negative result.4 To exemplify this distinction consider, for instance, the hidden query shown below providing a simple login query; the values of name and password provided by the agent would be placed into {0} and {1}, respectively.
SELECT secret FROM users
WHERE name="{0}" and password = "{1}" or name
="example"
Table 1 shows an example of four different exploratory SQLi actions, and the respective answers r with their lengths \(m=f(r)\). While legitimate and incorrect queries return limited or no data, the correct SQLi causes the information disclosure of a large amount of data. This example also shows that our agent can receive more information than a simple binary answer, as in [6, 7]; this additional information carried by length could be important for more sophisticated scenarios.
Table 1
Examples of SQLi attempts against the sample hidden query and server responses
Input for {0}
HTML response
Length
’ and 1=1#
Example secret data
19
" and 1=1#
The attackers personal data
27
" and 1=2#
 
1
" or 1=1#
All secret data
752
Notice that the agent input is meant to be inserted in {0}, with the rest of the hidden query being disabled by the comment symbol, #
Our function f accounts for a few side cases encoded by negative values. If the response contains the objective flag, the agent is notified that it succeeded using \(m=-1\). In case the response contains the SQL server version, the agent receives \(m=-2\); this response is relevant for confirming when a specific exploit is possible. For an expert, the SQL version number would give hints on how to conduct the exploit. Also, note that not all vulnerability types leak the version number directly. For our agent, however, it is only used as proof that the agent has found the vulnerability and can start the exploit. Finally, if an error observable by the user is present, the agent obtains \(m=-3\); this information is critical when dealing with possible error-based exploitation. Finally, when dealing with time-based blind exploitation, the agent also receives a time token in the form of a Boolean value, where true signals long time and false signals short time.

3.2 Action space

Our action space consists of 65 possible discrete actions divided into 38 predefined queries, one action to declare the absence of any SQLi vulnerability, and 26 exploit actions consisting of multiple queries. Each action serves a specific purpose, as explained below (the exact definition of all the actions is available in “Appendix A”):
  • Actions 1–12 encode exploratory queries that can be used to detect the presence of vulnerabilities and the possible presence of one of the three escape characters in the hidden query. These are actions that human experts often use to get a baseline of how the website works.
  • Actions 13–18 are queries designed to deal with stack-based vulnerabilities. The agent can use them to probe for the SQL version and attempt a stack-based SQLi.
  • Actions 19–36 are queries related to union-based vulnerabilities. These actions allow the agent to deduce the correct number of columns in the hidden query and exploit union-based vulnerabilities.
  • Actions 37–54 are queries concerned with Boolean-based blind vulnerabilities; as the information leak is small, hence the larger set of actions. Using these queries, the agent can probe if the ASCII encoding of the first character SQL version number is smaller or greater than 64. The choice of 64 as the initial probe is commonplace for ethical hacking experts since it is likely to give the most significant hint as to the first character in the SQL version. This query is, of course, followed by more probing queries if the attacker aims to find the full SQL version.
  • Action 55 is used by the agent to state the absence of any SQL vulnerability.
  • Actions 56–59 are introduced from Simulation2 to deal with error-based vulnerabilities. They consist of two probing queries to see if it is possible to force an error and two exploit actions as the empty escape has no specific error-based exploit. Notice that our probing queries are not a multiple of three because the empty escape throws an error with any of the two other probing queries and therefore does not require a separate probing query or a separate exploit action.
  • Actions 60–65 are relevant only in Simulation3 to handle time-based blind vulnerabilities and exploit them.
The exploit actions consist of many queries. The number of queries depends on the attacker’s goal and which type of vulnerability is being exploited. By choosing this action, the agent has determined the vulnerability and has, at this point, all the information necessary to carry out the attack. However, for computational and practical reasons, we do not require the completion of the attack. An attacker may cause an information disclosure, modify one or multiple data in the database, or even, in more advanced cases, try to execute OS commands; our agent stops short of performing the attack.
Importantly, all the actions above have no pre-defined meaning for the agent. Even though we partition them into well-defined subsets, all actions are identical syntactical constructs for the agent. Through interaction and trial-and-error, the agent infers each action’s relevance and builds a strategy to tackle the different possible challenges it has to confront.

3.3 State and transitions

The state is used to represent what the agent knows, while state transitions are used to capture the evolution of the knowledge of the agent. Therefore, we let the state be a set of sets, where each set contains all the actions resulting in the same message m. Most sets, thus, contain all actions that resulted in an HTML response with the same length. For example, if the state is ((2,), (3, 5, 6))), it means that action number 3, 5, and 6 resulted in the same length, while query number 2 resulted in a different length. Since we have some unique responses, such as \(-2\) and \(-3\), we have special sets for containing the responses associated with them. This means that the example above would read as ((\(-3\),), (\(-2\),), (2,), (3, 5, 6),). In this case, it would mean that our \(-3\) set is empty and our \(-2\) set is empty, but 3, 5, and 6 resulted in the same length, but 2 resulted in a different length.

3.4 Tabular Q-learning

In these proof-of-concept simulations, we use tabular Q-learning. Tabular Q-learning was chosen over alternative reinforcement learning algorithms, such as PPO [20] and DQN [21], because of its simplicity and interpretability. Interpretability is particularly important in our simulations as it allows us to analyze, monitor and explain the agent, both once it is trained and during training. This allowed for a more in-depth analysis of how the agent simultaneously tackles the different SQL injection vulnerability exploits. This technique could be extended to work with other reinforcement learning algorithms, but this is outside the scope of this work.

3.5 Simulation1

Simulation1 considers the first three vulnerability types: stack-based, union-based and Boolean-based blind, along with the possibility of interacting with a server with no vulnerability. In this simulation, error and time information are irrelevant; only the HTML page length and SQL version (\(m=-2\)) are relevant. All environment pre-processing boils down to identifying a pre-defined string (e.g., 8.0.21-0ubuntu0.20.04.4) or computing the length of the server HTML response. Finding the version number gives our agent definitive proof that it can access secret information. Therefore, it confirms that it has found the vulnerability before it executes its final action, which can be very costly as it consists of multiple queries. For an expert, the SQL version also tells her which exploits may be infeasible; the SQL version may protect against some vulnerabilities.
This simulation represents a small extension of the existing state of the art. Instead of considering a single type of SQLi vulnerability, the agent is presented with three different vulnerabilities. These vulnerabilities share some similarities, and we expect the RL agent to be able to exploit these commonalities in learning and to effectively exploit stack-, union-, and Boolean-based blind vulnerabilities. As a baseline, we would expect the agent to use around four queries: two to find the correct escape out of three possible options, one to distinguish the right vulnerability, and another to declare the start of the exploitation. According to a sequential approach commonly used by pentesters, we also expect the agent to first master identifying no vulnerability as this is simpler than exploiting a vulnerability. Next, we expect the agent to exploit stack-based vulnerabilities as it only needs to find the escape. Then exploit Boolean-based blind as it needs the escape and the Boolean-value of the hidden query, and finally, union-based vulnerabilities as it needs to find the escape and the number of rows.

3.6 Simulation2

Simulation2 adds to Simulation1 error-based vulnerabilities, thus bringing the number of possible scenarios the agent faces to five (four vulnerabilities plus absence of vulnerability). Our environment now performs more preprocessing to capture errors displayed to the user and produce the message \(m=-3\); practically, this requires the pre-processor to capture error pages. For simplicity, we assume the target website not to contain the word error, and so we reduce the preprocessing only to detecting the presence of the word error.
Adding a new vulnerability increases the challenge for the agent. However, in our encoding, error-based vulnerabilities are relatively easy to identify and exploit, so we do not expect a significant change in the learning dynamics or a large increase in the number of states explored by the agent. Actually, after successfully learning, we may expect the average number of queries to decrease since an error-based vulnerability would require only two steps to exploit. In a sequential approach, we expect the agent to get a handle on error-based vulnerabilities and no vulnerability cases first, then follow the same pattern as in Simulation1: solving stack-based, Boolean-based blind, and finally union-based vulnerabilities.

3.7 Simulation3

Finally, in Simulation3, we include all vulnerabilities. Adding time-based blind vulnerabilities requires the environment to return the time information along with the message m. For simplicity, response time is divided into a long and short time, chosen with an arbitrary threshold. In general, when the query is true, the server will use more time to access the relevant archives, and the environment will generate a long-time message. On the other hand, if the query is false, the response will come quickly and generate a short-time message. However, the time message is not deterministic because of network traffic. With a certain probability, p high-network traffic will change a short-time message into a long-time message. We experiment with a deterministic no-traffic network (\(p=0\)) and a stochastic low-traffic network (\(p=.05\)). The presence of traffic will introduce uncertainty, making it reasonable for the agent to retry its queries from time to time. However, given low traffic and the availability of multiple queries that can be used to assess the target, we find that the agent may not need to store duplicate query results. Despite the probability of traffic is low, the likelihood of an episode being affected by an event of high-traffic increases with the number of queries. The relationship between the episode traffic probability, \(P(e_t)\), and the query traffic probability, \(P(q_t)\) is given by:
$$\begin{aligned} P(e_t) = 1 - (1- P(q_t))^{(n-1)}, \end{aligned}$$
(2)
where n is the number of queries in the episode; notice that in our computation we consider \((n-1)\) assuming the last query to be the exploitation, after which we are not concerned with network traffic anymore. This means that episodes that end in 1, 2, 10, and 20 queries have an episode traffic probability of 0%, 5%, 37%, and 62.3%, respectively.
The addition of time-based blind vulnerabilities, especially in the presence of traffic, significantly increases the complexity. They represent a vulnerability having fewer commonalities with the previous ones, thus limiting the amount of information the agent can transfer during learning. We expect the need for more extensive training and a modest increase in the number of states explored during training. In the absence of traffic, the agent could solve the time-based blind vulnerability similarly to Boolean-based blind problems, as both problems provide binary information, either a simple true or false or as a long-time and short-time. With traffic, however, time-based blind will be require more time to be solved.

4 Results and discussion

In this section, we will first look at and discuss the training of the agents during the three simulations. Then we will evaluate the resulting agents and provide insights into the different strategies they employ to solve the CTF problems.

4.1 Training

We train the agents in our dynamic SQL environment for a million episodes, except for Simulation3, where we increased it to ten million. The first 90% of the episodes adopt an exploration rate of 0.1 except for Simulation2, which has an initial exploration rate of 0.02, while the last 10% use an exploration rate of 0. An episode terminates if the agent executes the correct exploit action or claims that there is no vulnerability. Alternatively, we set a maximum number of queries per episode to 100. Notice that this number is larger than the set of available actions. This limit is meaningful only at the beginning when the agent will randomly explore all the possible actions and observe their results. By the end, however, this limit will be irrelevant as the agent will have learned a strategy to achieve its goal with a few steps. We let the reward of a single query be \(-1\); the flag returns a value worth 1000, and an incorrect solution attempt, including deciding to give up when there is a solution, produces a reward of \(-1000\).

4.1.1 Simulation1

We run Simulation1 with the above settings and name the resulting trained agent: agent1. Figure 1a shows the evolution of the number of states and cumulative successes during training. Agent1’s learning may be divided into three phases: exploratory, refinement, and exploitation. The exploratory phase lasts from the start to around 1.3 \(\times 10^5\) episodes, during which almost every state is novel, and most exploit attempts fail. In the following refinement phase, from roughly 1.3 \(\times 10^5\) to 9 \(\times 10^5\) episodes, agent1 reliably solves the problem while still exploring to see if it can find better solutions. The final phase, the exploitation phase, is the artificial phase created by setting the exploration rate to 0 for the last 90% episodes; this phase will be discussed in the next section. The transition between the exploratory and refinement phases is of particular interest. This change is likely because once the agent can reliably exploit one vulnerability, the others closely follow.
Figure 1b offers a breakdown of the successes of agent1 on the different challenges, which helps illuminate this transition. Indeed around 1 \(\times 10^5\) episodes, agent1 learns to solve all the challenges. However, while global learning is monotonic, the learning dynamics for individual challenges are not; in multiple simulations, agent1 improved on a specific challenge (e.g., stack-based), only to plunge in its performance later. This oscillating behavior may be due to the agent exploring a complex solution space and learning to negotiate between different challenges without favoring one in particular. Nevertheless, the simultaneous growth of the remaining curves highlights that overall the agent is able to learn at the same time how to deal with all the challenges and possibly exploit commonalities among them. Another interesting feature of these curves is their step-like behavior: this likely corresponds to the agent progressively learning how to perform an exploit in specific subversions of the problem. In our simulation, we have three distinct escape characters ( ’, " and \(\epsilon \)), and learning to deal with each may correspond to a specific step in the learning curve. The exact order in which agent1 learns the exploit depends on the random order in which the problems are presented during a simulation; however, a general trend is for stack-based vulnerabilities and no vulnerability to be solved first, with union-based or Boolean-based blind vulnerabilities last. This result agrees with our hypothesis. However, it was unforeseen that in some cases, the agent is better at solving stack-based vulnerabilities before it gets the hang of solving no vulnerability cases. The agent’s ordering in solving the two first challenges was more fluid than initially thought. Similarly, the order the agent learns to deal with Boolean-based blind exploits and union-based exploits were more training dependent than initially hypothesized.

4.1.2 Simulation2

For Simulation2, we again trained an agent using the same settings as in Simulation1, except with an initial exploration rate of 0.02 on an environment that could also contain error-based vulnerabilities and named it agent2. Figure 2a shows the number of states and cumulative reward during training. In contrast to Simulation1, Simulation2 shows a smoother transition between the exploratory and refinement phase. Noticeably, agent2 explores fewer states and achieves more successes than in Simulation1; as discussed, this is because error-based vulnerabilities are easier5 for the agent to deal with in our implementation. This observation is confirmed in the breakdown of successes in Fig. 2b, where we can see that agent2 learns almost immediately how to solve error-based vulnerabilities. The overall dynamics for the other vulnerabilities are analogous to what we observed for agent1, with a more accentuated staircase-like trajectory.

4.1.3 Simulation3

In Simulation3, we include time-based blind vulnerabilities and scale the number of training episodes to 10 million. We train two agents, agent3 and agent3t, respectively trained in an environment with no traffic and in an environment with 5% traffic.
No traffic
In Fig. 3a we show the number of states and the number of cumulative successes for agent3 for the first million episodes to allow an easier comparison with the previous agents. The plot highlights how long agent3 keeps learning, even beyond the usual transition from the exploratory regime to the refinement regime; indeed, agent3 keeps encountering more states at a frequency significantly higher than previous agents. After a million episodes, agent3 has observed \(8.2 \times 10^5\) states, almost an order of magnitude more than agent1 or agent2. The previous training horizon of one million episodes is barely sufficient for learning a good strategy.
Figure 3b shows the success rate of the different vulnerabilities, again during the first million episodes. The dynamics are consistent with previous simulations, although noisier. Easy vulnerabilities, like error-based or absence of vulnerability, are solved quickly, as these always require little information and represent a good bet for the agent. The evolution of the time-based blind curve is particularly noisy. This is likely because in the blind-based vulnerability it is not possible to get a concrete confirmation of the vulnerability like we can by probing the SQL version or getting an error.
Traffic of 5 percent
Like before, Fig. 4a shows the number of states and number of successes of agent3t in the first million episodes of its training. The introduction of traffic and uncertainty has an immediate effect on learning. Agent3t has no way to deal explicitly with uncertainty, so it is forced to consider results affected by traffic as separate states. This translates into an explosion of states: by the end of the first million episodes, agent3t explored \(2.3 \times 10^6\) states. The transition between the exploratory and refinement phases is still present but happens after more episodes than in other simulations.
The effect of traffic is evident in the breakdown in Fig. 4b, where we now plot the first 2 million training episodes, as it takes agent3t more than one million episodes to solve time-based blind vulnerabilities. More than agent3, all curves are now affected by the noise introduced by traffic. The uncertainty is especially problematic for the time-based blind vulnerabilities, where the agent takes longer than any other vulnerability. Further insight into the learning quality of agent3 and agent3t is offered in Fig. 5a, b, where we plot the last two million episodes; notice, that just before the last million episodes we reduce the exploration rate to zero. Given the longer time, agent3 managed to learn an optimal policy that, in the absence of exploration, consistently solves the problem successfully. Agent3t, on the other hand, has not been able to deal completely with the uncertainty due to traffic, and its success rate on time-based blind vulnerabilities stalled around 95%, consistently with the level of traffic the agent was facing.

4.2 Analysis of the agent

We further evaluate the behavior of our agents by testing them on 1000 vulnerabilities logging how many actions they take to exploit a vulnerability and whether or not they were successful in the exploiting the vulnerability.

4.2.1 Simulation1

Agent1 successfully exploited all 1000 vulnerabilities it was tested on. The number of actions by vulnerability type is shown as percentages in Fig. 6.
Surprisingly, in 32% of the instance of stack-based vulnerability, agent1 was able to capture the flag in just two queries. As shown in Table 2, a basic stack-based exploitation would require two actions to probe the correct escape character, one action to check the actual presence of a stack-based vulnerability using the third available escape character, and then the final action for the actual exploit.
Table 2
Expected trajectory for agent1 to perform stack-based exploitation
Qn
State
Action
m
1
((\(-2\)))
" or 1=1#
425
2
((\(-2\)),(2,))
or 1=1#
425
3
((\(-2\)),(2,10))
’; select @@version;#
\(-2\)
4
((\(-2\),14),(2,10))
’ Exploit stack FINAL
\(-1\)
Qn query number, m the environment response
Instead, agent1 has learned a greedier strategy that allows it to solve certain instances faster by overlapping actions. As shown in Table 3, the first query can be used to check the escape characters and the presence of the stack based vulnerability; in about 33% instances, consistently with our experimental results, the agent will guess right and will be able to launch an exploit right after. However, if the escape character is wrong, the agent will receive little useful information; in order to work out the stack-based overflow it will then be forced to submit more queries. Indeed very fast stack-based exploits are balanced by slower ones requiring seven or eight queries, reducing the average number of steps to more than five. Although this may seem sub-optimal, it is worth remarking that the agent is confronted by a number of different vulnerabilities: if the agent were dealing only with stack-based vulnerabilities we would expect it to learn the four-steps strategy discussed above; however, the agent is trying to solve multiple problems at the same time. Table 4 shows a sample of this longer execution to achieve stack-based exploitation. After the initial negative result, the agent uses a further query to try to figure out the right escape character. The third and fourth actions test whether a union-based vulnerability is present, and the fifth looks for a Boolean-based blind vulnerability. Only with the sixth and seventh actions does the agent consider the possibility of stack-based exploitation. While it may feel natural for a human agent to consider vulnerabilities one by one and exhaust them, the agent interleaves different attempts trying to maximize the joint information it may extract. Therefore, the number of steps to achieve an exploitation depends not only on the specific vulnerabilities present but also on how the agent switches attention to different possible vulnerabilities.
Table 3
Action trajectory for agent1 to perform stack-based exploitation in 2 steps
Qn
State
Action
An
m
1
((\(-2\)),)
’; select @@version;#
14
\(-2\)
2
((\(-2\),14),))
’ Exploit stack FINAL
15
\(-1\)
Qn query number, An action number, m the environment response
Table 4
Action trajectory of agent1 for solving stack-based vulnerabilities in 7 steps
Qn
State
Action
An
m
1
(\(-2\),)
’; select @@version;#
14
697
2
((\(-2\),), (14,))
and 1=2#
9
608
3
((\(-2\),), (9,), (14,))
union select (select @@version),2,3#
32
697
4
((\(-2\),), (9,), (14, 32))
union select (select @@version)#
30
697
5
((\(-2\),), (9,), (14, 30, 32))
’ or ASCII(Substr((select @@version),1,1))<64#
45
697
6
((\(-2\),), (9,), (14, 30, 32, 45))
; select @@version;#
16
\(-2\)
7
((\(-2\), 16),(9,),(14, 30, 32, 45))
FINAL multi_stack FINAL
17
\(-1\)
Qn query number, An action number, m the environment response
Table 5
Action trajectory of agent1 for solving union based vulnerabilities in 4 steps
State
Action
Action nr
Response(m)
(\(-2\),)
’; select @@version;#
14
len43
((\(-2\),), (14,))
and 1=2#
9
len67
((\(-2\),), (9,), (14,))
union select (select @@version),2,3#
32
SQLversion(\(-2\))
((\(-2\), 32), (9,), (14,))
Exploit union rows:3 FINAL
35
Flag(\(-1\))
Table 5 shows how agent1 can proceed from a failed stack-based exploitation probing to the exploitation of a union-based vulnerability. Notice the different response lengths after the first and second query: this gives the agent an indication that \(\epsilon \) is the correct escape, and it allows it to go straight to guessing the number of rows; in this example, it guesses on the first trial, thereby reaching the exploit in just four queries.
This analysis of the trajectories of the agents has allowed us to get a better reading of the results in Fig. 6. Agent1 is trying to optimize a joint strategy in which it sends requests useful to collect as much information as possible; it starts betting on a stack-based vulnerability, while keeping the hypothesis of a union-based vulnerability open until the end, and collecting data to evaluate Boolean-based vulnerability. It can deterministically decide in six steps whether the target does not expose any of the vulnerabilities it has learned to target.

4.2.2 Simulation2

Of the 1000 vulnerabilities agent2 was tested on, it successfully exploited 999 of them. The vulnerability agent2 failed on was a Boolean-based blind vulnerability with the trajectory shown in Table 6.
Table 6
Action trajectory of agent2 failing to solve Boolean-based blind vulnerability
Qn
Action
An
m
1
57
425
2
"
55
425
3
’; select @@version;#
14
425
4
" and 1=1#
0
425
5
; select @@version;#
16
425
6
and 1=2#
9
425
7
’ or 1=2#
7
425
8
union select (select @@version),2#
31
425
9
No vulnerability(termination)
54
192
Qn query number, An action number, m the environment response
From this we see that agent2 probed all the three different escapes with queries 4, 6 and 7, however, they all resulted in the same length. A natural conclusion for the agent would in this case be that there is no vulnerability, however, both in our simulated environment and in real-world cases it is possible that even with the correct escape the website appears unchanged and has the same length. Such cases can sometimes even trick real world pentesters, although both the experts and the agent could try to alleviate the problem by querying the escapes in multiple ways. In our experiment this can be done by using both “or” and “and”.
Figure 7 shows the distribution the number of queries used by agent2 to exploit the different vulnerabilities during the 999 successful test episodes.
Like agent1, also agent2 tries to maximize the informativeness of its responses to collect the information necessary to solve any possible challenge. However, we notice a different ordering in how the problems are solved. The first queries target error-based vulnerabilities, which can be solved quickly in just two actions. After that, agent2 exploits the information already collected to address stack-based vulnerabilities, similarly to agent1. Union-based are also dealt with in the background, with their distribution being a bell-shaped curve spanning 7–11 queries. Determining the absence of vulnerabilities now takes nine queries instead of 6, while Boolean-based blind vulnerabilities take the longest.
The distribution in the number of queries used by agent2 shows fewer overlaps between different vulnerabilities than the distribution of agent1; the strategy of agent2 may thus resemble more the approach of a human expert trying to evaluate the presence of different vulnerabilities one by one. Nevertheless, agent2 is always working in parallel, as information collected while looking for a vulnerability is promptly used for the next one; for instance, as an error-based vulnerability proves absent, information collected can be immediately exploited in case of a stack-based vulnerability.

4.2.3 Simulation3

For agent3 and agent3t, we test the agent on 10,000 cases to improve the statistics on the performance of time-based blind vulnerabilities.
No traffic
On these vulnerabilities, agent3 achieves a success rate of 99.98% with 9998 successes and two failed exploits of Boolean-based blind vulnerabilities when evaluating its performance. Like in Simulation2, this two failures6 are due to the website being vulnerable to Boolean-based blind exploits, but appearing not to since the response that usually would leak information about the vulnerability has the same length as the non-information leak. Figure 8 shows the distribution in the number of queries used by agent3.
The introduction of time-based blind vulnerabilities brings limited change to the distribution of the number of queries. Agent3 seems to collect more information at the beginning: error-based vulnerabilities now require a minimum of three queries, while, on the other hand, deciding that no vulnerability is present takes eight queries instead of nine. Stack-based tends to be solved at the beginning, while union-based, boolean-based blind, and time-based blind are interleaved and require multiple queries.
Traffic
Increasing the traffic to 5%, we observe a decrease in the success rate, as reported in Table 7
Table 7
Success rates for different vulnerability types for agent3t
Stack-based
99.88%
Union-based
99.88%
Boolean-based blind
99.88%
No vulnerability
100%
Error-based
100%
Time-based blind
95.83%
The uncertainty of the response time creates a challenge for the agent in identifying even easy vulnerabilities such as stack-based vulnerabilities. The most significant difficulty is with time-based blind, as it is also the case for real-life experts.
Table 8
Action trajectory of agent3t failing to solve time-based blind vulnerability
Qn
Action
Action nr
m
0
’ or 1=1#
6
(734, slow)
1
57
(734, slow)
2
" or 1=2#
3
(734, slow)
3
"
55
(734, slow)
4
’ union select (select @@version)#
24
(734, slow)
5
or 1=1#
10
(734, slow)
6
’ and 1=1#
4
(734, slow)
7
and ASCII(Substr((select @@version),1,1))\(>=\)64#
48
(734, slow)
8
" Exploit time hiddenq f FINAL
59
(734, slow)
Qn query number, m the environment response
Table 8 shows one of the failed trajectories. Here, the agent tested all the escapes with queries 0, 2, and 5; this means that at least one response was affected by traffic. Probably more queries were influenced, even if not every occurrence of a slow response means that there has been traffic; most of these slow responses likely stem from the database being accessed. An expert may send these queries multiple times and then investigate the time difference. However, this is a behavior that is impossible for agent3t. The agent’s state is a set of sets grouping action and response; in its current form the agent has no way of tracking multiple attempts. This coding allowed us to restrict the state-space, and it worked well for Simulation1 and Simulation2, however, in the presence of traffic it shows limitation that will be addressed in future work.
Despite this restriction, the agent achieves as high an accuracy as 96% on time-based blind vulnerabilities. This statistic is not a direct counterpart to the 5% traffic, as it is a 5% probability of traffic per query and most time-based blind exploits require more than one query. Agent3t uses 2–16 actions on its successful exploits, which means that it observes the response of 1–15 queries, corresponding to an episode traffic probability ranging from 5 to 53.7%.

5 General discussion and conclusion

We have shown that a single RL agent can deal with a number of SQLi vulnerabilities in parallel. We relied on an efficient and automatic way of extracting relevant information from an HTML page, and we demonstrated the efficacy of our agent against all 5 SQLi vulnerability archetypes with consistent success.
In our experiments we observed that the RL agents learned to solve multiple challenges by interleaving vulnerability-specific queries and by accumulating and transferring knowledge. We also noticed a relative ordering in the number of steps required to solve challenges consistent the complexity of the vulnerabilities; stack- and -error-based and giving up were the most straightforward exploits, while time-based blind, Boolean-based blind, and union-based vulnerabilities are the most complex.
We observed that the agents’ learning hardly changed when going from Simulation1 to Simulation2, adding the more straightforward exploit of error-based vulnerabilities. Moreover, perhaps surprisingly, the agent explored fewer states, though this is likely mainly due to the lower exploration rate.
Dealing with the traffic and uncertainty introduced with time-based blind proved more challenging than other vulnerability types. Going from Simulation2 to Simulation3, the agent needed significantly more exploration. Time-based blind is a challenging vulnerability for real-life experts too, and it was the only vulnerability in which the agent did not achieve roughly 100% accuracy. These vulnerabilities proved challenging for our Q-learning agent as they introduced a significant amount of uncertainty. To deal with this uncertainty, a state-space that better capture this stochasticity could be conceived. Alternatively, a Q-learning agent could be used to deal with the main static challenges at first, while a second agent or model could be used to reconstruct the traffic patterns of the network and address time-based blind vulnerabilities if present.
Another improvement can be made in the representation of an HTML page; we processed it down to the single statistics of length, but other more sophisticated and informative processing could be used, such as hashing, compressing, or encoding. All could provide the agent with richer information.
This work provided a proof of concept of the feasibility of performing SQLi exploitation via RL. The main future challenge would be to adapt the RL agent to handle real-world challenges. This would require expanding our state and action space while retaining sufficient structure to keep the problem manageable; hierarchical space or customizable strings (using, for instance, wildcards) may be considered. Further, a real-world agent may consider the cost of performing specific queries: this may allow the agent to deal with targets where more than one vulnerability is present and where one exploitation may be more expensive than another.

Declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Human and animal rights

This article does not contain any studies with human or animal participants performed by any of the authors.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://​creativecommons.​org/​licenses/​by/​4.​0/​.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Anhänge

Appendix A: Action space

Here we report the definition of all the actions in the action space divided in groups.

Actions for detecting the presence of the vulnerability and the escape character

0.
" and 1=1#
 
1.
" and 1=2#
 
2.
" or 1=1#
 
3.
" or 1=2#
 
4.
and 1=1#
 
5.
and 1=2#
 
6.
or 1=1#
 
7.
or 1=2#
 
8.
’ and 1=1#
 
9.
’ and 1=2#
 
10.
’ or 1=1#
 
11.
’ or 1=2#
 

Actions for verification and exploitation of stacked based queries

12.
"; select @@version;#
 
13.
Exploit:" stack FINAL (multiple requests starting with "; to obtain the flag with stack-based way)
 
14.
’; select @@version;#
 
15.
Exploit:’ stack FINAL (multiple requests starting with ’; to obtain the flag with stack-based way)
 
16.
; select @@version;#
 
17.
Exploit: stack FINAL (multiple requests starting with; to obtain the flag with stack-based way)
 

Actions for verification and exploitation of the union-based queries

18.
" union select (select @@version)#
 
19.
" union select (select @@version),2#
 
20.
" union select (select @@version),2,3#
 
21.
Exploit:" union rows:1 FINAL (multiple requests starting with " union select (request_goes here) to obtain the flag with union way)
 
22.
Exploit:" union rows:2 FINAL (multiple requests starting with " union select (request_goes here),2 to obtain the flag with union way)
 
23.
Exploit:" union rows:3 FINAL (multiple requests starting with " union select (request_goes here),2,3 to obtain the flag with union way)
 
24.
’ union select (select @@version)#
 
25.
’ union select (select @@version),2#
 
26.
’ union select (select @@version),2,3#
 
27.
Exploit:’ union rows:1 FINAL (multiple requests starting with ’ union select (request_goes here) to obtain the flag with union way)
 
28.
Exploit:’ union rows:2 FINAL (multiple requests starting with ’ union select (request_goes here),2 to obtain the flag with union way)
 
29.
Exploit:’ union rows:3 FINAL (multiple requests starting with ’ union select (request_goes here),2,3 to obtain the flag with union way)
 
30.
union select (select @@version)#
 
31.
union select (select @@version),2#
 
32.
union select (select @@version),2,3#
 
33.
Exploit: union rows:1 FINAL (multiple requests starting with union select (request_goes here) to obtain the flag with union way)
 
34.
Exploit: union rows:2 FINAL (multiple requests starting with union select (request_goes here),2 to obtain the flag with union way)
 
35.
Exploit: union rows:3 FINAL (multiple requests starting with union select (request_goes here),2,3 to obtain the flag with union way)
 

Actions for verification and exploitation of Boolean-based blind

36.
" and ASCII(Substr((select @@version),1,1))\(>=\) 64#
 
37.
" and ASCII(Substr((select @@version),1,1))<64#
 
38.
" or ASCII(Substr((select @@version),1,1))\(>=\)64#
 
39.
" or ASCII(Substr((select @@version),1,1))<64#
 
40.
Exploit:" Booleanblind hq:F FINAL (multiple requests starting with: " or to obtain the flag with Boolean-based blind way. We use or since the hidden query is false.)
 
41.
Exploit:" Booleanblind hq:T FINAL (multiple requests starting with: " and to obtain the flag with Boolean-based blind way. We use and here since the hidden query is true.)
 
42.
’ and ASCII(Substr((select @@version),1,1))\(>=\)64#
 
43.
’ and ASCII(Substr((select @@version),1,1))<64#
 
44.
’ or ASCII(Substr((select @@version),1,1))\(>=\)64#
 
45.
’ or ASCII(Substr((select @@version),1,1))<64#
 
46.
Exploit:’ Booleanblind hq:F FINAL (multiple requests starting with: ’ or to obtain the flag with Boolean-based blind way. We use or since the hidden query is false.)
 
47.
Exploit:’ Booleanblind hq:T FINAL (multiple requests starting with: ’ and to obtain the flag with Boolean-based blind way. We use and here since the hidden query is true.)
 
48.
and ASCII(Substr((select @@version),1,1))\(>=\)64#
 
49.
and ASCII(Substr((select @@version),1,1))<64#
 
50.
or ASCII(Substr((select @@version),1,1))\(>=\)64#
 
51.
or ASCII(Substr((select @@version),1,1))<64#
 
52.
Exploit: Booleanblind hq:F FINAL (multiple requests starting with: or to obtain the flag with Boolean-based blind way. We use or since the hidden query is false.)
 
53.
Exploit: Booleanblind hq:T FINAL (multiple requests starting with: and to obtain the flag with Boolean-based blind way. We use and here since the hidden query is true.)
 

Actions for giving up

54.
FINAL no vulnerability FINAL
 

Actions for verification and exploitation of error based

55.
"
 
56.
Exploit:" error FINAL (Multiple requests exploiting the error information.)
 
57.
 
58.
Exploit:’ error FINAL (Multiple requests exploiting the error information.)
 

Actions for verification and exploitation of time-based blind

59.
Exploit:" time hiddenq f FINAL (multiple requests starting with: " or to obtain the flag with Time-based blind way. We use or since the hidden query is false.)
 
60.
Exploit:" time hiddenq t FINAL (multiple requests starting with: " and to obtain the flag with Time-based blind way. We use and here since the hidden query is true.)
 
61.
Exploit:’ time hiddenq f FINAL (multiple requests starting with: ’ or to obtain the flag with Time-based blind way. We use or since the hidden query is false.)
 
62.
Exploit:’ time hiddenq t FINAL (multiple requests starting with: ’ and to obtain the flag with Time-based blind way. We use and here since the hidden query is true.)
 
63.
Exploit: time hiddenq f FINAL (multiple requests starting with: or to obtain the flag with Time-based blind way. We use or since the hidden query is false.)
 
64.
Exploit: time hiddenq t FINAL (multiple requests starting with: and to obtain the flag with Time-based blind way. We use and here since the hidden query is true.)
 
Notice that all the query actions end with a # to comment out the last part of the hidden query; two notable exceptions to this are the error-based queries; these are meant to give an error so we do not comment out the last part of the hidden query.

Appendix B: Hyperparameter tuning

For the reinforcement learning agent there are a number of parameters that have to be tuned. We discuss how we selected the following hyper parameters:
1.
Q-table initialization
 
2.
Learning rate (\(\lambda \))
 
3.
Rewards
 
4.
Discount factor (\(\gamma \))
 
5.
Exploration rate
 
6.
Max step
 
Further fine-tuning of the hyper parameters for improving performances or in more realistic scenarios is left for future work.

Q-table initialization

The Q-table is the primary driver of how the agent chooses actions, setting the agent’s starting preferences and biases. The initial Q-value for an unseen state is initialized to \(0.99 + rand(0,0.01)\), where rand(0, 0.01) is a random number between 0 and 0.01, drawn from a uniform distribution. This allows for some initial variation but also allows the agent to quickly learn and avoid repeating an action if it is unfavorable. For memory reasons, all unseen states were initialized as they were explored, allowing the agent to explore this massive state space while not storing all of it in memory.

Learning rate

The learning rate gives how much an agent learns from a single action. This value is typically between 0 and 1 and often changes during training. We used a fixed learning rate of 0.1 for these experiments, which proved to work well.

Rewards

The rewards are the backbone of the reinforcement learning agent’s learning, giving an indication of which behavior is favorable or unfavorable. As we want the agent to take as few actions as possible, we adopted the rewards shown in Table 9.
Table 9
Expected trajectory for agent1 to perform stack-based exploitation
Response and action
Reward
Normal query (any response)
\(-1\)
Wrong exploit action
\(-1000\)
Flag found
+1000
Give up when a vulnerability exists
\(-1000\)
This ratio of 1–1000 ensures that the agent’s primary focus is solving the problem, and minimizing the number of queries is more of a secondary goal. With a smaller reward-to-punishment ratio, the agent sometimes tried to guess the solution.

Discount factor

The discount factor (\(\gamma \)) gives the weight of future rewards when training the agent. We use a standard discount factor of 0.9.

Exploration rate

The exploration rate allows the agent to ignore the current state of the Q-table and select a random action allowing for exploration and potentially finding new and unseen solutions. In our experiments, we chose a default exploration rate of 0.1 for the first 90% of the training episodes and 0 for the remaining 10%. This was so that the agent could initially explore while finding reasonable candidate solutions and solidify its knowledge for the last 10% by focusing on perfecting known exploits rather than exploration. For Simulation2, we opted for an initial exploration rate of 0.02. This lower exploration rate was motivated by preliminary runs that showed that the agent did not have time to fully converge in 1 million training episodes. However, with a lower exploration rate, it stabilized faster and tackled the relevant challenges.

Max step

The max step sets an upper bound for the maximum number of queries an agent can send before the episode is terminated. This naturally impacts the number of possible states the agent can explore; for this experiment, we did not wish for the agent to be too restricted in its exploration and chose a max step of 100. This is a large number, especially given that the agent has at most 64 actions available to it.
Note here that for the time-based blind with traffic, this upper bound could be reached as the same query could result in different results. However, initial experiments did not show this to be an issue.

Appendix C: Additional results

Here we provide some additional results from our experiments.

Simulation1: trajectory for a stack-based exploitation in 8 queries

Table 10
Action trajectory of agent1 for solving stack-based vulnerabilities in 8 steps
Qn
State
Action
An
m
1
(\(-2\),)
’; select @@version;#
14
502
2
((\(-2\),), (14,))
and 1=2#
9
502
3
((\(-2\),), (9,14))
’ and 1=1#
4
502
4
((\(-2\),), (4,9,14))
" or 1=1#
2
464
5
((\(-2\),), (2,), (4,9,14))
" and 1=1#
0
464
6
((\(-2\),), (0,2), (4,9,14))
" or ASCII(Substr((select @@version),1,1))\(>=\)64#
38
502
7
((\(-2\),), (0,2), (4,9,14,38))
"; select @@version;#
12
\(-2\)
8
((\(-2\),12), (0,2), (4,9,14,38))
" FINAL multi_stack FINAL
19
\(-1\)
Qn query number, An action number, m the environment response
Table 10 shows the trajectory of agent1 solving a stack-based exploitation in 8 steps. Surprisingly, when agent1 observes a new length after query number 4, it executes another exploratory query, " and 1=1#, with the same escape as to confirm its suspicions. This may be due to the relatively low cost of a query and the high cost of a mistaken exploit (and incomplete optimization). Next, in query number 6, it does a single Boolean-blind specific test, similarly to what it does when using 7 queries, as in Table 4. After this, it checks for the version number and successfully executes the exploit.

Simulation3: failed trajectories with no traffic

Table 11 shows one of the two trajectories where agent3 failed to identify a Boolean-based blind vulnerability.
Table 11
Action trajectory of agent3 failing to solve Boolean-based blind vulnerability
Qn
Action
Action nr
m
0
or 1=1#
10
(380, slow)
1
’ or 1=1#
6
(230, fast)
2
’; select @@version;#
14
(230, fast)
3
’ and 1=2#
5
(230, fast)
4
; select @@version;#
16
(230, fast)
5
and 1=1#
8
(654, slow)
6
and ASCII(Substr((select @@version),1,1))\(>=\)64#
48
(230, fast)
7
union select (select @@version),2,3#
32
(230, fast)
8
union select (select @@version)#
30
(230, fast)
9
Exploit union rows:2 FINAL
34
(15, slow)
Qn query number, An action number, m the environment response
Table 12
Action trajectory of agent3 failing to solve Boolean-based blind
Qn
Action
Action nr
Response(m)
0
or 1=1#
10
(380, slow)
1
’ or 1=1#
6
(230, fast)
2
’; select @@version;#
14
(230, fast)
3
’ and 1=2#
5
(230, fast)
4
; select @@version;#
16
(230, fast)
5
and 1=1#
8
(654, slow)
6
and ASCII(Substr((select @@version),1,1))\(>=\)64#
48
(230, fast)
7
union select (select @@version),2,3#
32
(230, fast)
8
union select (select @@version)#
30
(230, fast)
9
Exploit union rows:2 FINAL
34
(15, slow)
Qn query number, An action number, m the environment response
From this we see that it mistakes a Boolean-based blind vulnerability with a union-based vulnerability. If we analyse this mistake we see that, after query 6, agent3 clearly is biased towards a union-based vulnerability, and after determining that it is not 1 or 3 rows, its option is 2 rows. However, in this case the agent has been tricked, like in the trajectory shown in Table 6, because the length of the HTML response is the same as for no SQL, and the response time is also fast as the operating system version is smaller than 64, resulting in a very fast lookup.
The second failure case for agent3 was again a Boolean-based blind vulnerability, mistaken this time for a time-based blind vulnerability. This is a more natural misunderstanding as Boolean-based blind vulnerabilities can typically also be solved in a time-based blind way. The mistake came from an analogous server behavior: the HTML response has the same length as the Boolean-based blind probing query, and the probing response is fast. Full trajectory is shown in Table 12.

Simulation3t: number of queries by vulnerability

Figure 9 shows the number of actions required for each of the different exploits when the traffic is 5%. The presence of traffic seems to make the solution of the challenges more uncertain, leading to further overlap in the solution of the problems. Union-, stack-, Boolean-, and time-based all show overlapping

Computational cost of training

The computational cost of training our agents: agent1, agent2, agent3, and agent3t, correlates strongly with the number of training episodes. However, an episode consists of one to a hundred actions. In the extreme case, a million episodes could consist of 100 million actions but could also consist of just one million actions. We log the number of actions the different agents send during training, and it is clear that the longest and shortest episodes occur at the beginning of training. The final number of actions taken during training is shown in Table 13. We see that computation cost correlates with the number of episodes and the complexity of the simulation. Simulation1 and Simulation2 both trained the agents for one million episodes; however, agent2, which also tackles error-based vulnerabilities, had a higher computational cost than agent1. This increase in computational cost is despite agent2 using less memory than agent1 by exploring fewer states. However, this is primarily due to Simulation2 demanding that the agent accounts for error-based vulnerabilities, increasing the required number of actions for agent2. For Simulation3, we have ten times as many episodes, dramatically increasing the computational cost. In addition to the increased number of episodes, time-based blind adds considerable complexity, which is also reflected in the computational cost. We also see that the addition of traffic adds an increased cost of a similar proportion as was observed when adding error-based vulnerabilities.
Table 13
Number of actions taken by the agents during training
Agent
Number of episodes (\(10^6\))
Number of actions (\(10^6\))
agent1
1
6.61
agent2
1
7.06
agent3
10
75.54
agent3t
10
80.02
Fußnoten
1
In other literature, this hidden query may also be referred to as first query, pre-generated query, or server-side query.
 
2
Our simulation considers only one web parameter.
 
3
In our simulation we do not simulate an entire web page, but instead return a simulated length between 0 and 1000.
 
4
In this implementation, the agent primarily uses this to distinguish between a positive and a negative result, but the distinction may be useful for future more complex attacks.
 
5
It is easier as the agent receives immediate feedback in the form of an error.
 
6
These two failures can be found in the “Appendix”.
 
Literatur
4.
Zurück zum Zitat Vinyals, O., Babuschkin, I., Czarnecki, W.M., Mathieu, M., Dudzik, A., Chung, J., Choi, D.H., Powell, R., Ewalds, T., Georgiev, P., et al.: Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 575(7782), 350–354 (2019)CrossRef Vinyals, O., Babuschkin, I., Czarnecki, W.M., Mathieu, M., Dudzik, A., Chung, J., Choi, D.H., Powell, R., Ewalds, T., Georgiev, P., et al.: Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 575(7782), 350–354 (2019)CrossRef
5.
Zurück zum Zitat Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Žídek, A., Potapenko, A., et al.: Highly accurate protein structure prediction with AlphaFold. Nature 596(7873), 583–589 (2021)CrossRef Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Žídek, A., Potapenko, A., et al.: Highly accurate protein structure prediction with AlphaFold. Nature 596(7873), 583–589 (2021)CrossRef
7.
Zurück zum Zitat Del Verme, M., Sommervoll, Å.Å., Erdödi, L., Totaro, S., Zennaro, F.M.: SQL injections and reinforcement learning: an empirical evaluation of the role of action structure. In: Tuveri, N., Michalas, A., Brumley, B.B. (eds.) Secure IT Systems—26th Nordic Conference, NordSec 2021, Virtual Event, November 29–30, 2021, Proceedings. Lecture Notes in Computer Science, vol. 13115, pp. 95–113. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-91625-1_6 . (NordSec) Del Verme, M., Sommervoll, Å.Å., Erdödi, L., Totaro, S., Zennaro, F.M.: SQL injections and reinforcement learning: an empirical evaluation of the role of action structure. In: Tuveri, N., Michalas, A., Brumley, B.B. (eds.) Secure IT Systems—26th Nordic Conference, NordSec 2021, Virtual Event, November 29–30, 2021, Proceedings. Lecture Notes in Computer Science, vol. 13115, pp. 95–113. Springer, Cham (2021). https://​doi.​org/​10.​1007/​978-3-030-91625-1_​6 . (NordSec)
9.
Zurück zum Zitat Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (2018) Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (2018)
11.
Zurück zum Zitat Kamoun, F., Iqbal, F., Esseghir, M.A., Baker, T.: AI and machine learning: a mixed blessing for cybersecurity. In: 2020 International Symposium on Networks, Computers and Communications (ISNCC), pp. 1–7. IEEE (2020) Kamoun, F., Iqbal, F., Esseghir, M.A., Baker, T.: AI and machine learning: a mixed blessing for cybersecurity. In: 2020 International Symposium on Networks, Computers and Communications (ISNCC), pp. 1–7. IEEE (2020)
14.
Zurück zum Zitat Dhir, N., Hoeltgebaum, H., Adams, N., Briers, M., Burke, A., Jones, P.: Prospective artificial intelligence approaches for active cyber defence. arXiv:2104.09981 (2021) Dhir, N., Hoeltgebaum, H., Adams, N., Briers, M., Burke, A., Jones, P.: Prospective artificial intelligence approaches for active cyber defence. arXiv:​2104.​09981 (2021)
15.
Zurück zum Zitat Bland, J.A., Petty, M.D., Whitaker, T.S., Maxwell, K.P., Cantrell, W.A.: Machine learning cyberattack and defense strategies. Comput. Secur. 92, 101738 (2020) Bland, J.A., Petty, M.D., Whitaker, T.S., Maxwell, K.P., Cantrell, W.A.: Machine learning cyberattack and defense strategies. Comput. Secur. 92, 101738 (2020)
16.
Zurück zum Zitat Zennaro, F.M., Erdodi, L.: Modeling penetration testing with reinforcement learning using capture-the-flag challenges and tabular Q-learning. arXiv:2005.12632 (2020) Zennaro, F.M., Erdodi, L.: Modeling penetration testing with reinforcement learning using capture-the-flag challenges and tabular Q-learning. arXiv:​2005.​12632 (2020)
17.
Zurück zum Zitat Ghanem, M.C., Chen, T.M.: Reinforcement learning for efficient network penetration testing. Information 11(1), 6 (2019)CrossRef Ghanem, M.C., Chen, T.M.: Reinforcement learning for efficient network penetration testing. Information 11(1), 6 (2019)CrossRef
18.
Zurück zum Zitat Zakrzewska, A.N., Ferragut, E.M.: Modeling cyber conflicts using an extended petri net formalism. In: 2011 IEEE Symposium on Computational Intelligence in Cyber Security (CICS), pp. 60–67. IEEE (2011) Zakrzewska, A.N., Ferragut, E.M.: Modeling cyber conflicts using an extended petri net formalism. In: 2011 IEEE Symposium on Computational Intelligence in Cyber Security (CICS), pp. 60–67. IEEE (2011)
19.
Zurück zum Zitat Ghanem, M.C., Chen, T.M.: Reinforcement learning for intelligent penetration testing. In: 2018 Second World Conference on Smart Trends in Systems, Security and Sustainability (WorldS4), pp. 185–192. IEEE (2018) Ghanem, M.C., Chen, T.M.: Reinforcement learning for intelligent penetration testing. In: 2018 Second World Conference on Smart Trends in Systems, Security and Sustainability (WorldS4), pp. 185–192. IEEE (2018)
20.
Zurück zum Zitat Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv:1707.06347 (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv:​1707.​06347 (2017)
21.
Zurück zum Zitat Hester, T., Vecerik, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., et al.: Deep q-learning from demonstrations. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) Hester, T., Vecerik, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., et al.: Deep q-learning from demonstrations. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
Metadaten
Titel
Simulating all archetypes of SQL injection vulnerability exploitation using reinforcement learning agents
verfasst von
Åvald Åslaugson Sommervoll
László Erdődi
Fabio Massimo Zennaro
Publikationsdatum
10.08.2023
Verlag
Springer Berlin Heidelberg
Erschienen in
International Journal of Information Security / Ausgabe 1/2024
Print ISSN: 1615-5262
Elektronische ISSN: 1615-5270
DOI
https://doi.org/10.1007/s10207-023-00738-3

Weitere Artikel der Ausgabe 1/2024

International Journal of Information Security 1/2024 Zur Ausgabe

Premium Partner