Abstract
Translating natural language to SQL (NL2SQL) has received extensive attention lately, especially with the recent success of deep learning technologies. However, despite the large number of studies, we do not have a thorough understanding of how good existing techniques really are and how much is applicable to real-world situations. A key difficulty is that different studies are based on different datasets, which often have their own limitations and assumptions that are implicitly hidden in the context or datasets. Moreover, a couple of evaluation metrics are commonly employed but they are rather simplistic and do not properly depict the accuracy of results, as will be shown in our experiments. To provide a holistic view of NL2SQL technologies and access current advancements, we perform extensive experiments under our unified framework using eleven of recent techniques over 10+ benchmarks including a new benchmark (WTQ) and TPC-H. We provide a comprehensive survey of recent NL2SQL methods, introducing a taxonomy of them. We reveal major assumptions of the methods and classify translation errors through extensive experiments. We also provide a practical tool for validation by using existing, mature database technologies such as query rewrite and database testing. We then suggest future research directions so that the translation can be used in practice.
- J. Abreu, L. Fred, D. Macêdo, and C. Zanchettin. Hierarchical attentional hybrid neural networks for document classification. In ICANN, pages 396--402, 2019.Google ScholarDigital Library
- D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. In ICLR, 2015.Google Scholar
- C. Baik, H. V. Jagadish, and Y. Li. Bridging the semantic gap with SQL query logs in natural language interfaces to databases. In ICDE, pages 374--385, 2019.Google ScholarCross Ref
- F. Basik, B. Hättasch, A. Ilkhechi, A. Usta, S. Ramaswamy, P. Utama, N. Weir, C. Binnig, and U. Çetintemel. Dbpal: A learned nl-interface for databases. In SIGMOD, pages 1765--1768, 2018. Google ScholarDigital Library
- Y. Bengio, P. Y. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Networks, 5(2):157--166, 1994. Google ScholarDigital Library
- B. Bogin, J. Berant, and M. Gardner. Representing schema structure with graph neural networks for text-to-sql parsing. In ACL, pages 4560--1565, 2019.Google ScholarCross Ref
- K. D. Bollacker, R. P. Cook, and P. Tufts. Freebase: A shared database of structured general human knowledge. In AAAI, pages 1962--1963, 2007. Google ScholarDigital Library
- J. Castelein, M. F. Aniche, M. Soltani, A. Panichella, and A. van Deursen. Search-based test data generation for SQL queries. In ICSE, pages 1120--1230, 2018. Google ScholarDigital Library
- J. Cheng, S. Reddy, V. A. Saraswat, and M. Lapata. Learning structured natural language representations for semantic parsing. In ACL, pages 44--55, 2017.Google ScholarCross Ref
- S. Chu, B. Murphy, J. Roesch, A. Cheung, and D. Suciu. Axiomatic foundations and algorithms for deciding semantic equivalences of SQL queries. PVLDB, 11(11):1482--1495, 2018. Google ScholarDigital Library
- S. Chu, C. Wang, K. Weitz, and A. Cheung. Cosette: An automated prover for SQL. In CIDR, 2017.Google Scholar
- D. A. Dahl, M. Bates, M. Brown, W. M. Fisher, K. Hunicke-Smith, D. S. Pallett, C. Pao, A. I. Rudnicky, and E. Shriberg. Expanding the scope of the ATIS task: The ATIS-3 corpus. In ARPA Human Language Technology Workshop, 1994. Google ScholarDigital Library
- C. Finegan-Dollak, J. K. Kummerfeld, L. Zhang, K. Ramanathan, S. Sadasivam, R. Zhang, and D. R. Radev. Improving text-to-sql evaluation methodology. In ACL, pages 351--360, 2018.Google ScholarCross Ref
- C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, pages 1126--1135, 2017. Google ScholarDigital Library
- R. Frank. Phrase structure composition and syntactic dependencies, volume 38. Mit Press, 2004.Google Scholar
- J. Ganitkevitch, B. V. Durme, and C. Callison-Burch. PPDB: the paraphrase database. In HLT-NAACL, pages 758--764, 2013.Google Scholar
- A. Giordani and A. Moschitti. Translating questions to SQL queries with generative parsers discriminatively reranked. In COLING, pages 401--410, 2012.Google Scholar
- B. J. Grosz. TEAM: A transportable natural-language interface system. In ANLP, pages 39--45, 1983. Google ScholarDigital Library
- Ç. Gülçehre, S. Ahn, R. Nallapati, B. Zhou, and Y. Bengio. Pointing the unknown words. In ACL, pages 140--149, 2016.Google Scholar
- J. Guo, Z. Zhan, Y. Gao, Y. Xiao, J. Lou, T. Liu, and D. Zhang. Towards complex text-to-sql in cross-domain database with intermediate representation. In ACL, pages 4524--4535, 2019.Google ScholarCross Ref
- P. Huang, C. Wang, R. Singh, W. Yih, and X. He. Natural language to structured query generation via meta-learning. In NAACL-HLT, pages 732--738, 2018.Google ScholarCross Ref
- S. Iyer, I. Konstas, A. Cheung, J. Krishnamurthy, and L. Zettlemoyer. Learning a neural semantic parser from user feedback. In ACL, pages 963--973, 2017.Google ScholarCross Ref
- D. Jurafsky and J. H. Martin. Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition. Prentice Hall series in artificial intelligence. Prentice Hall, Pearson Education International, 2009. Google ScholarDigital Library
- M. Lapata and L. Dong. Coarse-to-fine decoding for neural semantic parsing. In ACL, pages 731--742, 2018.Google Scholar
- F. Li and H. V. Jagadish. Constructing an interactive natural language interface for relational databases. PVLDB, 8(1):73--84, 2014. Google ScholarDigital Library
- Y. Li, D. Tarlow, M. Brockschmidt, and R. S. Zemel. Gated graph sequence neural networks. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016.Google Scholar
- Z. Lin, M. Feng, C. N. dos Santos, M. Yu, B. Xiang, B. Zhou, and Y. Bengio. A structured self-attentive sentence embedding. CoRR, abs/1703.03130, 2017.Google Scholar
- T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In NIPS, pages 3111--3119, 2013. Google ScholarDigital Library
- G. A. Miller. Wordnet: A lexical database for english. Commun. ACM, 38(11):39--41, 1995. Google ScholarDigital Library
- P. Pasupat and P. Liang. Compositional semantic parsing on semi-structured tables. In ACL, pages 1470--1480, 2015.Google ScholarCross Ref
- A. Popescu, A. Armanasu, O. Etzioni, D. Ko, and A. Yates. Modern natural language interfaces to databases: Composing statistical parsing with semantic tractability. In COLING, 2004. Google ScholarDigital Library
- A. Popescu, O. Etzioni, and H. A. Kautz. Towards a theory of natural language interfaces to databases. In IUI, pages 149--157, 2003. Google ScholarDigital Library
- P. J. Price. Evaluation of spoken language systems: the ATIS domain. In DARPA Speech and Natural Language Workshop, pages 91--95, 1990. Google ScholarDigital Library
- M. Rabinovich, M. Stern, and D. Klein. Abstract syntax networks for code generation and semantic parsing. In ACL, pages 1139--1149, 2017.Google ScholarCross Ref
- D. Saha, A. Floratou, K. Sankaranarayanan, U. F. Minhas, A. R. Mittal, and F. Özcan. ATHENA: an ontology-driven system for natural language querying over relational data stores. PVLDB, 9(12):1209--1220, 2016. Google ScholarDigital Library
- L. R. Tang and R. J. Mooney. Automated construction of database interfaces: Integrating statistical and relational learning for semantic parsing. In EMNLP, pages 133--141. Association for Computational Linguistics, 2000. Google ScholarDigital Library
- O. Vinyals, M. Fortunato, and N. Jaitly. Pointer networks. In NIPS, pages 2692--2700, 2015. Google ScholarDigital Library
- C. Wang, M. Brockschmidt, and R. Singh. Pointing out sql queries from text. Technical Report MSR-TR-2017-45, 2018.Google Scholar
- D. H. D. Warren and F. C. N. Pereira. An efficient easily adaptable system for interpreting natural language queries. Am. J. Comput. Linguistics, 8(3-4):110--122, 1982. Google ScholarDigital Library
- R. J. Williams and D. Zipser. A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1(2):270--280, 1989. Google ScholarDigital Library
- C. Xiao, M. Dymetman, and C. Gardent. Sequence-based structured prediction for semantic parsing. In ACL, 2016.Google ScholarCross Ref
- X. Xu, C. Liu, and D. Song. Sqlnet: Generating structured queries from natural language without reinforcement learning. CoRR, abs/1711.04436, 2017.Google Scholar
- N. Yaghmazadeh, Y. Wang, I. Dillig, and T. Dillig. Sqlizer: query synthesis from natural language. PACMPL, 1(OOPSLA):63:1--63:26, 2017. Google ScholarDigital Library
- S. Yavuz, I. Gur, Y. Su, and X. Yan. Dialsql: Dialogue based structured query generation. In ACL, pages 1339--1349, 2018.Google Scholar
- P. Yin and G. Neubig. A syntactic neural model for general-purpose code generation. In ACL, pages 440--450, 2017.Google ScholarCross Ref
- T. Yu, Z. Li, Z. Zhang, R. Zhang, and D. R. Radev. Typesql: Knowledge-based type-aware neural text-to-sql generation. In NAACL-HLT, pages 588--594, 2018.Google ScholarCross Ref
- T. Yu, M. Yasunaga, K. Yang, R. Zhang, D. Wang, Z. Li, and D. R. Radev. Syntaxsqlnet: Syntax tree networks for complex and cross-domain text-to-sql task. In EMNLP, pages 1653--1663, 2018.Google ScholarCross Ref
- T. Yu, R. Zhang, K. Yang, M. Yasunaga, D. Wang, Z. Li, J. Ma, I. Li, Q. Yao, S. Roman, Z. Zhang, and D. R. Radev. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. In EMNLP, pages 3911--3921, 2018.Google ScholarCross Ref
- J. M. Zelle and R. J. Mooney. Learning to parse database queries using inductive logic programming. In AAAI, pages 1050--1055, 1996. Google ScholarDigital Library
- L. S. Zettlemoyer and M. Collins. Learning to map sentences to logical form: Structured classification with probabilistic categorial grammars. In UAI, pages 658--666, 2005. Google ScholarDigital Library
- L. S. Zettlemoyer and M. Collins. Online learning of relaxed CCG grammars for parsing to logical form. In EMNLP-CoNLL, pages 678--687, 2007.Google Scholar
- V. Zhong, C. Xiong, and R. Socher. Seq2sql: Generating structured queries from natural language using reinforcement learning. CoRR, abs/1709.00103, 2017.Google Scholar
- M. Zhou, G. Cao, T. Liu, N. Duan, D. Tang, B. Qin, X. Feng, J. Ji, and Y. Sun. Semantic parsing with syntax- and table-aware SQL generation. In ACL, pages 361--372, 2018.Google Scholar
Comments