1 Introduction
2 Related Work
3 Text-to-SQL Approaches
3.1 Text-to-SQL Method 1: SQLNet
SELECT
clause and slot content within a WHERE
clause of a query, making use of two main techniques: sequence-to-set and column attention. The sequence-to-set approach predicts which column names appear in a subset of interest by computing probabilities given a column name and an utterance. Column attention is utilized in order to predict a specific column based on the embedding of an utterance. Before the prediction of the slot content, utterances and column names are first represented as sequence of tokens. Each token is enumerated and then encoded via a Glove (Pennington et al. 2014) word embedding. Glove is an unsupervised learning algorithm that is used to obtain vector representations of words. Afterwards, using the sequence-to-set technique the authors first predict a set of columns for the SELECT
and WHERE
clause. A SELECT
clause may consist of column names and aggregate operators, while the WHERE
clause may consist of a set of columns, operators, and values. After a column is selected for the SELECT
clause, a probability that this column is preceded by an aggregate operator is computed. Based on the predicted columns in the WHERE clause, predictions about the operator(s) and value(s) in it follow. Xu et al. consider the prediction of the number of columns in the WHERE
clause as a (N + 1)-way classification problem, while the prediction of the OP slot is considered a 3-way classification since an operator can be one of the following symbols: \(>, <, =\). Furthermore, the authors provide a way to derive possible cell values from an utterance by employing a sequence-to-sequence structure consisting of a biLSTM (Schuster and Paliwal 1997) encoder and a pointer network (Vinyals et al. 2015) decoder that uses a column attention mechanism. A biLSTM encoder contains two LSTM neural networks, processing the input in opposite directions. Pointer networks can compute the probability that a token from an input sequence, i.e., utterance, is part of the output sequence, i.e., value in an SQL query. Initially, SQLNet was trained on the WikiSQL dataset. For our evaluation, we used the adapted version (SQLNetSpider version 2023) for the Spider dataset and trained it on the Spider dataset (Yu et al. 2018).3.2 Text-to-SQL Method 2: RatSQL
QUESTION-COLUMN-M
, QUESTION-TABLE-M
, COLUMN-QUESTION-M
, and TABLE-QUESTION-M
. Thereby, M specifies whether there is an exact match, partial match or no match relation between n-grams of the utterance and names of columns or tables in the database schema. The output of the encoder is a joint representation of the column, table, and question. Value matching can be challenging as it requires access to the database content to correctly link values from an utterance to a column of the database schema not mentioned in the utterance. The authors propose value-based linking without exposing the model to the whole data of a database by adding a Column-Value relation between any word and column name if the question word occurs as a value (exact or partial) of a column. The RAT-SQL decoder generates the SQL program as an abstract syntax tree (Yin and Neubig 2017) and then an LSTM outputs one of the following decoder actions: ApplyRule (expands the last generated node into a grammar rule), SelectColumn (selects a column from the DB schema) or SelectTable (selects a table from the DB schema). Following the original publication, we evaluate RatSQL with a pre-trained Glove embedding.3.3 Text-to-SQL Method 3: LGESQL
3.4 Text-to-SQL Method 4: SmBop+GraPPa
3.5 Text-to-SQL Method 5: T5 + PICARD
3.6 Comparison of the Text-to-SQL Approaches
Model | submitted to | EM acc. [%] | |
---|---|---|---|
Spider leaderboard | (dev) | (test) | |
SQLNet | Sep, 2018 | 10.9 | 12.4 |
RatSQL (Glove) | Dec, 2019 | 62.7 | 57.2 |
LGESQL (Glove) | Feb, 2021 | 67.6 | 62.8 |
SmBoP+GraPPa | Mar, 2021 | 74.7 | 69.5 |
T5 + PICARD | May, 2022 | 77.2 | 72.4 |
Criterion | SQLNet | RatSQL | LGESQL | SmBoP + GraPPa | T5 + PICARD |
---|---|---|---|---|---|
Utterance | as sequence | as question- | as question- | as question- | as sequence of |
and schema | of tokens | contextualized | schema graph | contextualized | tokens |
representation | schema graph | schema graph | (sentence-piece | ||
tokenization) | |||||
Input | Glove | Glove, BERT | Glove, BERT, | GraPPa | token IDs; |
encoding | ELECTRA | relative | |||
positional | |||||
encoding | |||||
SQL | sketch aligning | SQL as AST + | SQL as AST + | based on semi- | autoregressive |
generation | to SQL | context-free | graph pruning | autoregressive | decoder with |
strategy | grammar | grammar | bottom-up | multi-head | |
parser | attention and | ||||
constrained | |||||
output during | |||||
beam search | |||||
Key | slot prediction, | encoder- | encoder- | RatSQL + | self-attention, |
techniques | sequence-to-set | decoder, | decoder, L | GraPPa | transfer |
prediction, | relational- | dual relational | encoder, semi- | learning, | |
column attention | aware | graph | autoregressive | constrained | |
self-attention | attention | decoder, cell | decoder, cell | ||
network layers | values via | values similar | |||
prob. | to bridging | ||||
computation |
Nan et al. (2022) |
Dataset | # utterances | # queries | # DBs | # domains | # tables per DB | # ORDER BY | # GROUP BY | # nested queries | # HAVING |
---|---|---|---|---|---|---|---|---|---|
Spider (Yu et al. 2018) | |||||||||
– all subsets | 10,181 | 5,693 | 200 | 138 | 5.1 | 1,335 | 1,491 | 844 | 388 |
– software engineering subset | 49 | 24 | 1 | 1 | 6 | 12 | 8 | 12 | 0 |
SEOSS-Queries | 1,162 | 166 | 1 | 1 | 13 | 16 | 49 | 15 | 8 |
4 Datasets
4.1 The Spider Dataset
GROUP BY
, JOIN
, LIMIT
, and ORDER BY
, are considered harder to predict (Yu et al. 2018). By splitting the SQL queries into different hardness levels, Yu et al. provide a way to better understand a model’s performance regarding queries’ complexity. In comparison to another large and popular text-to-SQL dataset, WikiSQL (Zhong et al. 2017), Spider contains more complex SQL queries covering SQL operators such as JOIN
and GROUP BY
(Yu et al. 2018). For this reasons, we chose Spider as training set in transfer learning experiments. We also chose Spider, since WikiSQL restricts query generation to solely one table. This means that predicted queries merely consist of the SELECT
and the WHERE
clause. In the text-to-SQL task we envision and study within the SE domain, solely querying one table when gathering and relating data from different software tools is unrealistic.4.2 The SEOSS-Queries Dataset
Problematic SQL query | Incompatibility |
---|---|
SELECT CASE WHEN EXISTS (SELECT * FROM change_set_link WHERE issue_id = ’PIG-4092’ AND commit_hash = ’[...]’) | CASE statement |
THEN ’True’ ELSE ’False’ END | |
SELECT issue_id FROM issue WHERE strftime(’%Y-%m-%d’, created_date) = DATE(’now’,’-1 day’) | srftime() function |
SELECT * FROM issue AS T1 WHERE T1.issue_id NOT IN (SELECT T2.issue_id FROM issue_attachment AS T2) | NOT in WHERE clause |
SELECT * FROM issue WHERE type = ’Bug’ and (status = ’Resolved’ OR status = ’Closed’) AND (resolution= ’Fixed’ or resolution | brackets in WHERE clauses |
= ’Done’) | |
SELECT T1.* FROM issue AS T1 JOIN issue_fix_version AS T2 ON T1.issue_id = T2.issue_id WHERE T2.fix_version = ’0.12.1’ | T1.* in SELECT |
SELECT AVG(count_per_issue) FROM (SELECT Count(DISTINCT username) AS count_per_issue FROM issue_comment GROUP | keyword AS in SELECT |
BY issue_id) | |
SELECT Count(*) FROM issue WHERE description IS NULL | NULL in WHERE clause |
strtime
or specific SQL keywords such as NOT
, CASE
could not be processed, returning an exception. Further examples of incompatible queries are shown in Table 4. Aiming for comparability to Spider’s assessment, we decided not to adapt the evaluation script, but to remove incompatible queries, resulting in 931 utterances and 133 SQL queries for performing our experiments. These 133 queries are distributed across the four complexity levels as follows: 56 easy, 54 medium, 11 hard, and 12 extra hard. Figure 4 shows per hardness level an example of an utterance and its corresponding SQL query.5 Evaluation of Text-to-SQL Approaches for SE Tasks
5.1 Quantitative Evaluation of Text-to-SQL Approaches
Experiment 1 | Experiment 2 | Experiment 3 | ||||
---|---|---|---|---|---|---|
utterances | Train | Test | Train | Test | Train | Test |
(Sp. + SQ.) | (SQ.) | (Sp. + SQ.) | (SQ.) | (Sp. + SQ.) | (SQ.) | |
specific | 8659 + 0 | 133 | 8659 + 532 | 133 | 8659 + 742 | 108 |
non-specific | 8659 + 0 | 133 | 8659 + 532 | 133 | 8659 + 742 | 81 |
all | 8659 + 0 | 266 | 8659 + 532 | 266 | 8659 + 742 | 189 |
5.2 Survey on Usefulness of Text-to-SQL Approaches in the SE Domain
6 Evaluation Results and Discussion
Experiment 1 | Experiment 2 | Experiment 3 | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Spider Training | SEOSS-Queries Training | ||||||||||||||
em accuracy % \(\uparrow \) | em accuracy [%]\(\uparrow \) | em accuracy [%]\(\uparrow \) | |||||||||||||
Model | easy | med | hard | Xhard | all | easy | med | hard | Xhard | all | easy | med | hard | Xhard | all |
specific utterances | |||||||||||||||
count | 56 | 54 | 11 | 12 | 133 | 56 | 54 | 11 | 12 | 133 | 20 | 56 | 12 | 20 | 108 |
SQLNet | 0.071 | 0.000 | 0.000 | 0.000 | 0.030 | 0.464 | 0.093 | 0.182 | 0.000 | 0.248 | 0.000 | 0.018 | 0.083 | 0.000 | 0.019 |
RatSQL | 0.375 | 0.167 | 0.000 | 0.000 | 0.226 | 0.893 | 0.907 | 0.545 | 0.417 | 0.827 | 0.900 | 0.482 | 0.583 | 0.200 | 0.519 |
LGESQL | 0.554 | 0.352 | 0.182 | 0.083 | 0.398 | 0.929 | 1.000 | 0.818 | 1.000 | 0.955 | 0.950 | 0.554 | 0.833 | 0.200 | 0.593 |
SmBoP+GraPPa | 0.536 | 0.481 | 0.182 | 0.083 | 0.444 | 0.821 | 0.852 | 0.636 | 0.333 | 0.774 | 1.000 | 0.518 | 0.667 | 0.200 | 0.565 |
T5 + PICARD | 0.696 | 0.556 | 0.273 | 0.182 | 0.561 | 0.964 | 0.981 | 0.909 | 1.000 | 0.970 | 0.950 | 0.661 | 0.667 | 0.250 | 0.639 |
non-specific utterances | |||||||||||||||
count | 56 | 54 | 11 | 12 | 133 | 56 | 54 | 11 | 12 | 133 | 15 | 42 | 9 | 15 | 81 |
SQLNet | 0.018 | 0.000 | 0.000 | 0.000 | 0.008 | 0.268 | 0.148 | 0.182 | 0.000 | 0.188 | 0.067 | 0.000 | 0.222 | 0.000 | 0.037 |
RatSQL | 0.143 | 0.167 | 0.091 | 0.000 | 0.135 | 0.839 | 0.704 | 0.636 | 0.250 | 0.714 | 0.533 | 0.190 | 0.667 | 0.067 | 0.284 |
LGESQL | 0.286 | 0.204 | 0.182 | 0.000 | 0.218 | 0.875 | 0.852 | 1.000 | 0.833 | 0.872 | 1.000 | 0.405 | 0.556 | 0.000 | 0.457 |
SmBoP+GraPPa | 0.375 | 0.222 | 0.091 | 0.000 | 0.256 | 0.750 | 0.611 | 0.455 | 0.000 | 0.602 | 0.667 | 0.190 | 0.556 | 0.000 | 0.284 |
T5 + PICARD | 0.518 | 0.315 | 0.182 | 0.000 | 0.364 | 0.893 | 0.926 | 1.000 | 0.917 | 0.917 | 0.667 | 0.381 | 0.667 | 0.200 | 0.432 |
all utterances | |||||||||||||||
count | 112 | 108 | 22 | 24 | 266 | 112 | 108 | 22 | 24 | 266 | 35 | 98 | 21 | 35 | 189 |
SQLNet | 0.045 | 0.000 | 0.000 | 0.000 | 0.019 | 0.366 | 0.120 | 0.182 | 0.000 | 0.218 | 0.029 | 0.010 | 0.143 | 0.000 | 0.026 |
RatSQL | 0.259 | 0.167 | 0.045 | 0.000 | 0.180 | 0.866 | 0.806 | 0.591 | 0.333 | 0.771 | 0.743 | 0.357 | 0.619 | 0.143 | 0.418 |
LGESQL | 0.420 | 0.278 | 0.182 | 0.042 | 0.308 | 0.902 | 0.926 | 0.909 | 0.917 | 0.914 | 0.971 | 0.490 | 0.714 | 0.114 | 0.534 |
SmBoP+GraPPa | 0.455 | 0.352 | 0.136 | 0.042 | 0.350 | 0.786 | 0.731 | 0.545 | 0.167 | 0.688 | 0.857 | 0.378 | 0.619 | 0.114 | 0.444 |
T5 + PICARD | 0.607 | 0.426 | 0.227 | 0.087 | 0.457 | 0.929 | 0.954 | 0.955 | 0.958 | 0.944 | 0.829 | 0.541 | 0.667 | 0.229 | 0.550 |
6.1 Utilizing Text-to-SQL Out of the Box for SE Tasks (Experiment 1)
HAVING
clause. SQL queries marked as extra hard were the hardest to predict and only a very small share of them were correctly predicted by LGESQL, SmBoP+GraPPa, and T5 + PICARD (cp. Table 6 specific utterances). Non-specific utterances of hardness level extra hard (Xhard) could not be predicted by any approach suggesting that due to missing information like column and table names and due to performing the training without the SEOSS-Queries DB schema, it is much more difficult for the text-to-SQL approaches to link DB values to the corresponding table and columns.6.2 Text-to-SQL Specifically Trained for SE Needs (Experiment 2)
6.3 Predicting Non-Trained SE Utterance-Query Pairs (Experiment 3)
Experiment 2 | Experiment 3 | |||
---|---|---|---|---|
(p-value: 0.377) | (p-value: 0.766) | |||
LGESQL/T5+PICARD | True | False | True | False |
True | 228 | 13 | 80 | 21 |
False | 19 | 6 | 24 | 64 |
6.4 Large-Language-Models as Alternative to Text-to-SQL Models.
SOESS Queries Utterance-Based Split | SEOSS Queries Query-Based Split | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
em accuracy [%]\(\uparrow \) | em accuracy [%]\(\uparrow \) | |||||||||
Model | easy | med. | hard | Xhard | all | easy | med. | hard | Xhard | all |
specific utterances | ||||||||||
count | 56 | 54 | 11 | 12 | 133 | 20 | 56 | 12 | 20 | 108 |
ChatGPT | 0.643 | 0.500 | 0.000 | 0.083 | 0.481 | 0.750 | 0.071 | 0.000 | 0.200 | 0.213 |
GitHub Copilot | 0.411 | 0.241 | 0.000 | 0.167 | 0.286 | 0.950 | 0.143 | 0.000 | 0.000 | 0.250 |
T5 + PICARD | 0.964 | 0.981 | 0.909 | 1.000 | 0.970 | 0.950 | 0.661 | 0.667 | 0.250 | 0.639 |
non-specific utterances | ||||||||||
count | 54 | 11 | 12 | 133 | 15 | 42 | 9 | 15 | 81 | |
ChatGPT | 0.554 | 0.204 | 0.000 | 0.167 | 0.331 | 0.267 | 0.071 | 0.000 | 0.000 | 0.086 |
GitHub Copilot | 0.429 | 0.278 | 0.000 | 0.000 | 0.293 | 0.400 | 0.167 | 0.000 | 0.000 | 0.160 |
T5 + PICARD | 0.893 | 0.926 | 1.000 | 0.917 | 0.917 | 0.667 | 0.381 | 0.667 | 0.200 | 0.432 |
all utterances | ||||||||||
count | 112 | 108 | 22 | 24 | 266 | 35 | 98 | 21 | 35 | 189 |
ChatGPT | 0.598 | 0.352 | 0.000 | 0.125 | 0.406 | 0.543 | 0.071 | 0.000 | 0.114 | 0.159 |
GitHub Copilot | 0.420 | 0.259 | 0.000 | 0.083 | 0.289 | 0.714 | 0.153 | 0.000 | 0.000 | 0.212 |
T5 + PICARD | 0.929 | 0.954 | 0.955 | 0.958 | 0.944 | 0.829 | 0.541 | 0.667 | 0.229 | 0.550 |
6.5 Understanding SE users information needs and text-to-SQL utility (Survey)
SELECT * FROM issue AS T1 WHERE T1.issue_id IN (SELECT T2.issue_id FROM
issue_attachment AS T2)
was automatically assessed as hard, while a majority of participants rated it as medium. This tendency towards less complex judgements is interesting since we also assessed participant’s capabilities in creating queries below and will report about substantial deficiencies. We argue that query understanding differs from query creation in complexity, potentially explaining this divergence. Overall, even if rated somewhat easier participant’s mostly identified the same relative complexity rating of queries giving the measure credibility.WHERE
clause (4%). Eventually, only a minor group of 28% of the participants correctly constructed executable queries for the scenarios of complexity hard and extra hard. Queries were incomplete (32%), contained syntax errors (28%), or did not correctly reference the required database concepts (12%) (cp. Fig. 7). This success rate is dependent on participants’ experience with 54% of the experienced ones, i.e., SQL experience level good and very good, and merely 7% of the less experienced ones, i.e., SQL experience level basic and middle, writing a correct query. When asked for which form of retrieval was more difficult to create for the easy and medium scenarios, participants responded: query (56%), utterance (3%), and both equally (36%). For the hard and extra hard scenarios, a majority (76%) considers query creation more difficult, while 4% consider the utterance more difficult to create and the remaining ones are indifferent. Based on the results of Experiment 2 (cp. Table 6), we selected the best performing text-to-SQL approach, T5 + Picard, to validate utterances that we would have developed for the given scenario and used them to predict queries. We assessed all predicted queries to be correct.SELECT
and WHERE
, whether it was syntactically correct and found that none of them did correctly differentiate all correct and incorrect clauses in the queries. Separated into two clusters, we found that 84% of the participants identified at least 60% of the incorrect clauses in queries, while the remaining 16% identified less than 30% of the mistaken SQL clauses. Second, we asked participants to decide per presented query whether it was complete and found that 12% of them correctly answered this question for all four queries, while 36% did not assess a single one correctly. Finally, we asked participants to decide per presented query whether it referred to all necessary column and table names. We found that 40% of the participants correctly solved this task for all four provided queries, while 12% did not correctly decide for a single one. These results suggest, that in cases, where column or table names are incorrect, users can still use the wrongly generated SQL queries. However, this is not the case when the SQL query is incomplete, e.g., has missing SQL clauses. In such cases users are expected to need more time to complete the SQL. Here, we argue, that it may take SE experts less time, however, in comparison to writing the SQL from scratch, since the result from text-to-SQL already focuses on important tables and columns from the DB schema and partially generates a SQL.7 Threats to Validity
strtime
or specific SQL keywords such as NOT
, CASE
, from the SEOSS-Queries dataset could be processed, meaning that we could not use the entire SEOSS-Queries dataset for model evaluation. Nevertheless, without the incompatible SQL queries, we still had a substantial amount of data upon which to base our experiments. A concern may be expressed regarding the expressiveness of the SE dataset used in our study, considering that not all SQL grammar can be handled by text-to-SQL approaches from our study. Since our main goal is to assess whether text-to-SQL approaches can be considered applicable and useful in the SE domain, we decided not to extend the currently accepted SQL grammar, as to give a more realistic representation of the current capabilities of such approaches.