skip to main content
research-article
Open Access

In-IDE Code Generation from Natural Language: Promise and Challenges

Published:04 March 2022Publication History

Skip Abstract Section

Abstract

A great part of software development involves conceptualizing or communicating the underlying procedures and logic that needs to be expressed in programs. One major difficulty of programming is turning concept into code, especially when dealing with the APIs of unfamiliar libraries. Recently, there has been a proliferation of machine learning methods for code generation and retrieval from natural language queries, but these have primarily been evaluated purely based on retrieval accuracy or overlap of generated code with developer-written code, and the actual effect of these methods on the developer workflow is surprisingly unattested. In this article, we perform the first comprehensive investigation of the promise and challenges of using such technology inside the PyCharm IDE, asking, “At the current state of technology does it improve developer productivity or accuracy, how does it affect the developer experience, and what are the remaining gaps and challenges?” To facilitate the study, we first develop a plugin for the PyCharm IDE that implements a hybrid of code generation and code retrieval functionality, and we orchestrate virtual environments to enable collection of many user events (e.g., web browsing, keystrokes, fine-grained code edits). We ask developers with various backgrounds to complete 7 varieties of 14 Python programming tasks ranging from basic file manipulation to machine learning or data visualization, with or without the help of the plugin. While qualitative surveys of developer experience are largely positive, quantitative results with regards to increased productivity, code quality, or program correctness are inconclusive. Further analysis identifies several pain points that could improve the effectiveness of future machine learning-based code generation/retrieval developer assistants and demonstrates when developers prefer code generation over code retrieval and vice versa. We release all data and software to pave the road for future empirical studies on this topic, as well as development of better code generation models.

Skip 1INTRODUCTION Section

1 INTRODUCTION

One of the major hurdles to programming is the time it takes to turn ideas into code [77]. All programmers, especially beginners but even experts, frequently reach points in a program where they understand conceptually what must be done next, but do not know how to create a concrete implementation of their idea or would rather not have to type it in if they can avoid it. The popularity of the Stack Overflow Q&A website is a great example of this need. Indeed, developers ask questions about how to transform ideas into code all the time, e.g., “How do I check whether a file exists without exceptions?,”1 “How can I merge two Python dictionaries in a single expression?,”2 and so on. Moreover, this need is likely to continue in the future, as new APIs appear continuously, and existing APIs change in non-backwards compatible ways [80], requiring recurring learning effort [57, 84].

Despite early skepticism towards the idea of “natural language programming” [26], researchers now widely agree on a range of scenarios where it can be useful to be able to formulate instructions using natural language and have the corresponding source code snippets automatically produced. For example, software developers can save keystrokes or avoid writing dull pieces of code [32, 86, 99, 115]; and non-programmers and practitioners in other fields, who require computation in their daily work, can get help with creating data manipulation scripts [38, 62].

Given a natural language query carrying the intent of a desired step in a program, there are two main classes of methods to obtain code implementing this intent, corresponding to two major research thrusts in this area. On the one hand, code retrieval techniques aim to search for and retrieve an existing code fragment in a code base; given the abundance of code snippets online, on platforms such as Stack Overflow, it is plausible that a lot of the code that one might write, especially for lower-level functionality and API usage primitives, already exists somewhere, therefore the main challenge is search. On the other hand, code generation techniques aim to synthesize code fragments given natural language descriptions of intent. This is typically a harder challenge than retrieval and therefore more ambitious, but it may be particularly useful in practice if those exact target code fragments do not exist anywhere yet and can be generated instead.

The early attempts at general-purpose code generation from natural language date back to the early to mid 2000s and resulted in groundbreaking but relatively constrained grammatical and template-based systems, e.g., converting English into Java [93] and Python [112]. Recent years have seen an increase in the scope and diversity of such programming assistance tools, as researchers have devised code generation techniques that promise to be more flexible and expressive using machine (deep) learning models trained on data from “Big Code” repositories such as GitHub and Stack Overflow; see Allamanis et al. [3] for an excellent survey of such techniques. Code retrieval systems have also improved dramatically in recent years, thanks to the increasing availability of source code online and more sophisticated information retrieval and machine learning techniques; perhaps the most popular current code retrieval system is Microsoft’s Bing Developer Assistant [115], which is an adaptation of the Bing search engine for code.

While both types of methods (generation and retrieval) for producing appropriate code given natural language intents have received significant interest in machine learning circles, there is a surprising paucity of research using human-centered approaches [83] to evaluate the usefulness and impact of these methods within the software development workflow. An important open question is to what extent the typically high accuracy scores obtained during automatic evaluations on benchmark datasets will translate to real-world usage scenarios, involving software developers completing actual programming tasks. The former does not guarantee the latter. For example, an empirical study on code migration by Tran et al. [110] showed that the BLEU [89] accuracy score commonly used in natural language machine translation has only weak correlation with the semantic correctness of the translated source code [110].

In this article, we take one step towards addressing this gap. We implemented two state-of-the-art systems for natural language to code (NL2Code) generation and retrieval as in-IDE developer assistants and carried out a controlled human study with 31 participants assigned to complete a range of Python programming tasks with and without the use of the two varieties of NL2Code assistance. Our results reveal that while participants in general enjoyed interacting with our IDE plugin and the two code generation and retrieval systems, surprisingly there were no statistically significant gains in any measurable outcome when using the plugin. That is, tasks with code fragments automatically generated or retrieved using our plugin were, on average, neither completed faster nor more correctly than tasks where participants did not use any NL2Code assistant. This indicates that despite impressive improvements in the intrinsic performance of code generation and retrieval models, there is a clear need to further improve the accuracy of code generation, and we may need to consider other extrinsic factors (such as providing documentation for the generated code) before such models can make sizable impact on the developer workflow.

In summary, the main contributions of this article are: (i) A hybrid code generation and code retrieval plugin for the Python PyCharm IDE, which takes as input natural language queries. (ii) A controlled user study with 31 participants observed across 7 types of programming tasks (14 concrete subtasks). (iii) An analysis of both quantitative and qualitative empirical data collected from the user study, revealing how developers interact with the NL2Code assistant and the assistant’s impact on developer productivity and code quality. (iv) A comparison of code snippets produced by the two models, generation versus retrieval. (v) An anonymized dataset of events from our instrumented IDE and virtual environment, capturing multiple aspects of developers’ activity during the programming tasks, including plugin queries and edits, web browsing activities, and code edits.

Skip 2OVERVIEW OF OUR STUDY Section

2 OVERVIEW OF OUR STUDY

The goal of our research is to elucidate to what extent and in what ways current natural language programming techniques for code generation and retrieval can be useful within the development workflow as NL2Code developer assistants. Our main interest is evaluating the usefulness in practice of state-of-the-art NL2Code generation systems, which have been receiving significant attention from researchers in recent years, but have so far only been evaluated on benchmark datasets using standard NLP metrics. However, as discussed above, code generation and code retrieval are closely related problems, with increasingly blurred lines between them; e.g., recent approaches to align natural language intents with their corresponding code snippets in Stack Overflow for retrieval purposes [122] use similar deep learning technology as some code generation techniques [123]. Therefore, it is important to also consider code retrieval systems when experimenting with and evaluating code generation systems.

Given this complementarity of the two tasks, we select as a representative example of state-of-the-art techniques for code generation the semantic parsing approach by Yin and Neubig [123]. In short, the approach is based on a tree-based neural network model that encodes natural language utterances and generates corresponding syntactically correct target code snippets; for example, the model can generate the Python code snippet “x.sort(reverse=True)” given the natural language input “sort list x in reverse order.” We chose the approach by Yin and Neubig [123] over similar approaches such as those of Iyer et al. [49] and Agashe et al. [1], as it is the most general purpose and most naturally comparable to code retrieval approaches; see Section 9 for a discussion. For code retrieval, the closest analogue is Microsoft’s proprietary Bing Developer Assistant [115], which takes English queries as input and returns existing matching code fragments from the Web using the Bing search engine. However, given the proprietary nature of this system, we build a custom Stack Overflow code search engine inspired by it rather than use the system itself.

We then designed and carried out the controlled human study summarized in Figure 1. First, we implement the two code generation and retrieval techniques as a custom plugin for the PyCharm3 IDE, which takes as input natural language text intents and displays as output the corresponding code snippets generated and retrieved by the respective underlying models. Second, we compile 14 representative Python programming tasks across 7 task categories with varying difficulty, ranging from basic Python to data science topics. Third, we recruit 31 participants with diverse experience in programming in Python and with the different task application domains. Then, using an instrumented virtual environment and our IDE plugin, we collect quantitative and qualitative data about task performance and subjective tool use from each participant, as well as over 170 person hours of telemetry data from the instrumented environment.

Fig. 1.

Fig. 1. Overview of our study.

Finally, we analyze these data to answer three research questions, as follows:

RQ \( _{\mathbf {1}} \). How does using a NL2Code developer assistant affect task completion time and program correctness? This research question investigates quantitative differences in outcome variables between tasks completed in the treatment and control conditions. To this end, we use the log data from our instrumented virtual environment to compute task completion times, and rubric-based manual scoring of the solutions submitted by study participants to evaluate program correctness. Then, we use multivariate mixed-effects regression modeling to analyze the data. We expect that using the plugin developers can complete tasks faster, without compromising solution quality.

RQ \( _{\mathbf {2}} \). How do users query the NL2Code assistant, and how does that associate with their choice of generated vs. retrieved code? This research question investigates quantitatively three dimensions of the inputs and outputs of the NL2Code plugin. Again using log data from our instrumented virtual environment, we first model how the natural language input queries differ when study participants favor the code snippets returned by the code generation model over those returned by the code retrieval model. Second, we evaluate the quality of the natural language queries input by study participants in terms of their ability to be answerable by an oracle (human expert), which is also important for the success of NL2Code systems in practice, in addition to the quality of the underlying code generation or retrieval systems. Third, we study how the length and the frequency of different types of tokens changes after study participants edit the candidate code snippets returned by the NL2Code plugin, which could indicate ways in which even the chosen code snippets are still insufficient to address the users’ needs.

RQ \( _{\mathbf {3}} \). How do users perceive the usefulness of the in-IDE NL2Code developer assistant? Finally, this research question investigates qualitatively the experience of the study participants interacting with the NL2Code plugin and underlying code generation and retrieval models.

In the remainder of this article, Sections 34 describe our study setup in detail; then Sections 57 present our answers to the research questions; Section 8 discusses implications; and Section 9 discusses related work.

Following best practices for empirical software engineering research [107, 116], we make our study replicable, publishing our plugin prototype, instrumented virtual environment, data extraction and analysis scripts, and the obtained anonymized raw data; see the online appendices at https://github.com/neulab/tranX-plugin and https://github.com/neulab/tranX-study.

Skip 3NL2CODE IDE PLUGIN DESIGN Section

3 NL2CODE IDE PLUGIN DESIGN

We designed and built a joint NL2Code generation and retrieval plugin for PyCharm, a popular Python IDE. Our plugin is open source and available online.4 As mentioned above, the plugin takes as input an English query describing the user’s intent and gives as output a ranked list of the most relevant code snippets produced by each of the two underlying code generation and retrieval systems. Using IDE plugins to query Web resources such as Stack Overflow is expected to be less disruptive of developers’ productivity than using an external Web browser, since it reduces context switching [9, 91]. Moreover, there exist already a number of IDE plugins for Web/Stack Overflow search and code retrieval [17, 91, 98, 115], therefore the human-computer interaction modality should feel at least somewhat natural to study participants.

The Underlying Code Generation System. For code generation, we use the model by Xu et al. [117] (available online5), which is an improved version of the tree-based semantic parsing model by Yin and Neubig [124], further pre-trained on official API documentation in addition to the original training on Stack Overflow questions and answers. 6

This model reports state-of-the-art accuracy on the CoNaLa benchmark dataset [122], a benchmark dataset of intent/code pairs mined from Stack Overflow and standardly used to evaluate code generation models. Accuracy is computed using the BLEU score [89], a standard metric used in the NLP community, which measures the token-level overlap between the generated code and a reference implementation. As discussed above, the BLEU score (and similar automated metrics) are typically not sufficiently sensitive to small lexical differences in token sequence that can greatly alter the semantics of the code [110], hence our current human-centered study. Still, qualitatively, it appears that the model can generate reasonable code fragments given short text inputs, as shown in Table 1. Note how the model can generate syntactically correct code snippets by construction; demonstrates ability to identify and incorporate a wide variety of API calls; and also has the ability to copy important information such as string literals and variable names from the input natural language intent, in contrast to the code retrieval results. When displaying multiple generation results in the plugin described below, these results are ordered by the conditional probability of the generated code given the input command.

Table 1.
2lOpen a file “f.txt” in write mode.
f = open(’f.txt’, ’w’)
\( \clubsuit \)f = open(’f.txt’, ’w’)
\( \spadesuit \)with open(”users.txt”, ”a”) as f: f.write(username + ”-\( \textbackslash \)-n”)
Remove first column of dataframe df.
df = df.drop(df.columns[[0]], axis=1)
\( \clubsuit \)df.drop(df.columns[[0]])
\( \spadesuit \)del df[’column_name’]
Lower a string text and remove non-alphanumeric characters aside from space.
re.sub(r’[\( ^+ \)\( \textbackslash \)sa-zA-Z0-9]’, ”, text).lower().strip()
\( \clubsuit \)re.sub(r’[\( ^+ \)\( \textbackslash \)sa-zA-Z0-9]’, ”, text)
\( \spadesuit \)re.sub(r’[\( ^+ \)\( \textbackslash \)sa-zA-Z0-9]’, ”, text).lower().strip()

Table 1. Examples, where ✓ Is the Ground-truth Code Snippet, \( \clubsuit \) Is the Output from the State-of-the-Art Code Generation Model, and \( \spadesuit \) Is the First Candidate Retrieved from Stack Overflow Using Bing Search

The Underlying Code Retrieval System. For code retrieval, similarly to a number of recent works on the subject [17, 91, 115], we implement a wrapper around a general-purpose search engine, specifically the Bing7 search engine. 8 The wrapper queries this search engine for relevant questions on Stack Overflow,9 the dominant programming Q&A community, and retrieves code from the retrieved pages. A dedicated search engine already incorporates advanced indexing and ranking mechanisms in its algorithms, driven by user interaction data, therefore it is preferable to using the internal Stack Overflow search engine directly [115].

Specifically, we add the “Python” prefix to all user queries to confine the search to the Python programming language domain and add “site:stackoverflow.com” to confine the results to the Stack Overflow platform. We do not structurally alter the queries otherwise, e.g., we do not remove variables referenced therein, if any, although we do strip away grave accents that are part of the code generation model’s syntax.10 For the query example mentioned above, the actual query string for Bing search would become “Python reverse a list x site:stackoverflow.com.” For each Stack Overflow question page retrieved, we then extract the code snippets from the top three answers into a ranked list, sorted descending by upvotes. The code snippet extraction procedure follows Yin et al. [122] for identifying the code part of the answer, based on Stack Overflow-specific syntax highlighting and heuristics. When displaying multiple retrieval results, these results are ordered by the order they appeared in Bing search engine results, and the ordering of answers inside SO posts is done by upvotes.

Table 1 shows a few example outputs. Note how the retrieval results sometimes contain spurious code, not part of the natural language intent (first example), and otherwise seem to complement the generation results. Indeed, in the second example the generation result is arguably closer to the desired answer than the retrieval result, with the opposite situation in the third example. Interacting with the Plugin. Figure 2 illustrates the plugin’s user interface. The user first activates the query interface by pressing a keyboard shortcut when the cursor is in the IDE’s editor. A popup appears at the current cursor position (Figure 2(a)), and the user can enter a command in natural language that they would like to be realized in code (e.g., “reverse a list x11). The plugin then sends the request to the underlying code generation and code retrieval systems and displays a ranked list of results, with the top 7 code generation results at the top, followed by the top 7 code retrieval results (Figure 2(b)); 14 results are displayed in total. 12

Fig. 2.

Fig. 2. Screenshots of the in-IDE plugin taking a natural language query as input and listing code snippet candidates from both code generation and code retrieval.

The number 7 was chosen subjectively, trying to maximize the amount and diversity of resulting code snippets while minimizing the necessary screen space to display them and, therefore, the amount of scrolling expected from study participants looking to inspect all the plugin-returned results. After completing the current study, we found that the most relevant code snippets are typically within the top 3 results, and thus a smaller number of candidates may be sufficient. While the number and ordering of candidates has the potential to have a significant impact on the efficiency and efficacy of the developer assistant, a formal evaluation of this impact is beyond the scope of this work.

If a code snippet is selected, then the code snippet is then inserted in the current cursor’s position in the code editor. The user’s selection is also recorded by our instrumentation in the back end. Understandably, some returned code snippets may not be directly suitable for the context inside the editor, so the user is welcome (and encouraged by the instructions we give as part of our human study) to edit the auto-inserted code snippets to fit their specific intent. After the edit is done, the user is asked to upload their edits to our server, along with the context of the code, using a dedicated key combination or the IDE’s context menu. The process is illustrated in Figure 3. The edit data enable us to analyze how many and what kind of edits the users need to make to transform the auto-generated code to code that is useful in their context. 13

Fig. 3.

Fig. 3. Screenshots of fixing the small errors in generated code and upload the correct snippet.

Skip 4HUMAN STUDY DESIGN Section

4 HUMAN STUDY DESIGN

Given our NL2Code joint code generation and retrieval IDE plugin above, we designed and carried out a human study with 31 participants assigned to complete a range of Python programming tasks in both control (no plugin) and treatment (plugin) conditions.

4.1 Task Design

To emulate real world Python development activities, but also fit within the scope of a user study, we compiled a set of 14 reasonably sized Python programming tasks, organized into 7 categories (2 tasks per category) that span a diversity of levels of difficulty and application domains.

We started by identifying representative task categories that many users would encounter in practice. To that end, we analyzed two sources. First, we manually reviewed all the Python programming courses listed on three popular coding education websites (Udacity,14 Codecademy,15 and Coursera16) to identify modules commonly taught across all websites that indicate common usage scenarios of the Python language. Second, we cross-checked if the previously identified use cases are well represented among frequently upvoted questions with the [python] tag on Stack Overflow, which would further indicate real programmer needs. By searching the category name, we found that each of our identified categories covers more than 300 questions with more than 10 upvotes on Stack Overflow. We iteratively discussed the emerging themes among the research team, refining or grouping as needed, until we arrived at a diverse but relatively small set of use cases, covering a wide range of skills a Python developer may need in practice.

In total, we identified seven categories of use cases, summarized in Table 2. For each of the 7 categories, we then designed two tasks covering use cases in the most highly upvoted questions on Stack Overflow. To this end, we searched Stack Overflow for the “python” keyword together with another keyword indicative of the task category (e.g., “python matplotlib,” “python pandas”), selected only questions that were asking how to do something (i.e., excluding questions that ask about features of the language or about how to install packages), and drafted and iteratively refined after discussion among the research team tasks that would cover 3–5 of the most frequently upvoted questions.

Table 2.
Categorym̀ulticolumn2cTasks
Basic PythonT1-1Randomly generate and sort numbers and characters with dictionary
T1-2Date & time format parsing and calculation with timezone
FileT2-1Read, manipulate, and output CSV files
T2-2Text processing about encoding, newline styles, and whitespaces
OST3-1File and directory copying, name editing
T3-2File system information aggregation
Web ScrapingT4-1Parse URLs and specific text chunks from web page
T4-2Extract table data and images from Wikipedia page
Web Server & ClientT5-1Implement an HTTP server for querying and validating data
T5-2Implement an HTTP client interacting with given blog post APIs
Data Analysis & MLT6-1Data analysis on automobile data of performance metrics and prices
T6-2Train and evaluate a multi-class logistic regression model given dataset
Data VisualizationT7-1Produce a scatter plot given specification and dataset
T7-2Draw a figure with three grouped bar chart subplots aggregated from dataset

Table 2. Overview of Our 14 Python Programming Tasks

We illustrate this process with the following example task for the “Data visualization” category17:

By running python3 main.py, draw a scatter plot of the data in shampoo.csv and save it to shampoo.png. The plot size should be 10 inches wide and 6 inches high. The Date column is the x axis (some dates are missing from the data and in the plot the x axis should be completed with all missing dates without sales data). The date string shown on the plot should be in the format (YYYY-MM-DD). The Sales column is the y axis. The graph should have the title “Shampoo Sales Trend.” The font size of the title, axis labels, and x & y tick values should be 20pt, 16pt, and 12pt, respectively. The scatter points should be colored purple.

This task covers some of the top questions regarding data visualization with matplotlib found on Stack Overflow through the approach described above:

(1)

How do you change the size of figures drawn with matplotlib?18

(2)

How to put the legend out of the plot?19

(3)

Save plot to image file instead of displaying it using Matplotlib?20

(4)

How do I set the figure title and axes labels font size in Matplotlib?21

For each task designed, we also provide the user with required input data or directory structure for their program to work on, as well as example outputs (console print-outs, output files & directories, etc.) so they could verify their programs during the user study.

Table 2 summarizes the 14 tasks. The full task descriptions and input/output examples can be found online, as part of our replication package at https://github.com/neulab/tranx-study. The tasks have varying difficulties, and on average each task would take about 15–40 minutes to complete.

4.2 Participant Recruitment & Task Assignments

Aiming to recruit participants with diverse technical backgrounds but at least some programming experience and familiarity with Python to be able to complete the tasks, we advertised our study in two ways: (1) inside the university community through personal contacts, mailing lists, and Slack channels, hoping to recruit researchers and students in computer science or related areas; (2) on the freelancer platform Upwork,22 hoping to attract participants with software engineering and data science experience. We promised each participant US $5 per task as compensation; each participant was expected to complete multiple tasks.

To screen eligible applicants, we administered a pre-test survey to collect their self-reported levels of experience with Python and with each of the 7 specific task categories above; see Appendix B for the actual survey instrument. We only considered as eligible those applicants who reported at least some experience programming in Python, i.e., a score of 3 or higher given the answer range [1: very inexperienced] to [5: very experienced]; 64 applicants satisfied these criteria.

We then created personalized task assignments for each eligible applicant based on their self-reported levels of experience with the 7 specific task categories (see Appendix C for the distributions of participants’ self reported experience across tasks), using the following protocol:

(1)

To keep the study relatively short, we only assign participants to a total of 4 task categories (8 specific tasks, 2 per category) out of the 7 possible.

(2)

Since almost everyone eligible for the study reported being at least somewhat experienced with the first 2 task categories (Basic Python and File), we assigned everyone to these 2 categories (4 specific tasks total). Moreover, we assigned these 2 categories first and second, respectively.

(3)

For the remaining 5 task categories, sorted in increasing complexity order,23 we rank them based on a participant’s self-reported experience with that task genre, and then assign the participant to the top 2 task categories with most experience (another 4 specific tasks total).

Note that this filtering by experience is conducive to allowing participants to finish the tasks in a reasonable amount of time and reflective of a situation where a developer is working in their domain of expertise. However, at the same time it also means that different conclusions might be reached if novice programmers or programmers without domain expertise used the plugin instead.

Next, we randomly assigned the first task in a category to either the treatment condition, i.e., the NL2Code plugin is enabled in the virtual environment IDE and the participants are instructed to use it, 24 or the control condition, i.e., the NL2Code plugin is disabled. The second task in the same category is then automatically assigned to the other condition, e.g., if the plugin is on for task1-1, then it should be off for task1-2. Therefore, each participant was asked to complete 4 tasks out of 8 total using the NL2Code plugin, and 4 without.

Finally, we invited all eligible applicants to read the detailed study instructions, access the virtual environment, and start working on their assigned tasks. Only 31 out of the 64 eligible applicants after the pre-test survey actually completed their assigned tasks. 25 Their backgrounds were relatively diverse; of the 31 participants, 12 (39%) were software engineers and 11 (35%) were computer science students, with the rest being researchers (2, 6%), and other occupations (6, 19%). Our results below are based on the data from these 31 participants.

4.3 Controlled Environment

Participants worked on their assigned tasks inside a custom instrumented online virtual environment, accessible remotely. Our virtual machine is preconfigured with the PyCharm Community Edition IDE26 and the Firefox Web browser; and it has our NL2Code plugin either enabled or disabled inside the IDE, depending on the condition. See Appendix A for complete technical details.

In addition, the environment logs all of the user’s interactions with the plugin in the PyCharm IDE, including queries, candidate selections, and edits; all of the user’s fine-grained IDE editor activities; the user’s Web search/browsing activities inside Firefox; all other keystrokes inside the VM; and the source code for each one of the user’s completed tasks.

To get a sense of how the source code evolves, whenever the user does not make modifications to the code for at least 1.5 seconds, the plugin also automatically uploads the current snapshot of the code to our server. The intuition behind this heuristic is that after a user makes some type of meaningful edit, such as adding or modifying an argument, variable, or function, they usually pause for a short time before the next edit. This edit activity granularity can be more meaningful than keystroke/character level, and it is finer grained than intent level or commit level edits.

Given that it is identifiable, we record participants’ contact information (only for compensation purposes) separately from their activity logs. This Human Subjects research protocol underwent review and was approved by the Carnegie Mellon University Institutional Review Board.

4.4 Data Collection

To answer our research questions (Section 2), we collect the following sets of data:

Task Performance Data (RQ\( _{{\bf 1}} \)). The first research question compares measurable properties of the tasks completed with and without the help of our NL2Code IDE plugin and its underlying code generation and code retrieval engines. One would expect that if such systems are useful in practice, then developers would be able to complete programming tasks faster without compromising on output quality. To investigate this, we measure two variables related to how well study participants completed their tasks and the quality of the code they produced:

  • Task Completion Time. Since all activity inside the controlled virtual environment is logged, including all keystrokes and mouse movements, we calculate the time interval between when a participant started working on a task (first keystroke inside the IDE) and when they uploaded their final submission to our server.

    Recall that participants worked asynchronously and they may have decided to take breaks; we designed our virtual environment to account for this, with explicit pause/resume functionality. To account for possible breaks and obtain more accurate estimates of time spent on task, we further subtract the time intervals when participants used our explicit pause/resume functionality, as well as all intervals of idle time in which participants had no mouse or keyboard activity for two minutes or more (they may have taken a break without recording it explicitly).

    Figure 4 shows the distributions of task completion times across the two conditions (with and without the plugin).

    Distributions of task completion times (in seconds) across tasks and conditions (w/ and w/o using the plugin). The horizontal dotted lines represent 25% and 75% quartiles, and the dashed lines represent medians.

  • Task Correctness. Following the common practice in computer science education [18, 25, 36], we design a rubric for each task concurrently with designing the task and later score each submission according to that rubric. We weigh all tasks equally, assigning a maximum score of 10 points to each. For each task, the rubric covers both basic aspects (e.g., runs without errors/exceptions; produces the same output as the example output provided in the task description) as well as implementation details regarding functional correctness (e.g., considers edge cases, implements all required functionality in the task description).

    For example, for the data visualization task described in Section 4.1, we created the following rubric, with the number in parentheses representing the point value of an item, for a total of 10 points: (i) Runs without errors (2); (ii) Correct image output format (png) (2); (iii) Read in the raw data file in correct data structure (1); (iv) Correct plot size (1); (v) Correctly handle missing data points (1); (vi) Date (x axis) label in correct format (1); (vii) Title set correctly (1); (viii) Font size and color set according to specification (1).

    To reduce subjectivity, we graded each submission blindly (i.e., not knowing whether it came from the control or treatment condition) and we automated rubric items when possible, e.g., using input-output test cases for the deterministic tasks and checking if the abstract syntax tree contains nodes corresponding to required types (data structures) such as dictionaries. See our online appendix27 for the complete rubrics and test cases for all tasks.

    Figure 5 shows the distributions of scores across tasks, between the two conditions.

    Distributions of task correctness scores (0–10 scale) across tasks and conditions. The horizontal dotted lines represent 25% and 75% quartiles, and the dashed lines represent medians.

Plugin Queries, Snippets, and User Edits (RQ\( _{{\bf 2}} \)). We record user queries using the plugin, both the generated and retrieved code snippet candidates returned for the query, and the user selection from the candidates to insert into their source code. We use the data to analyze the NL queries and whether users preferred to use generated vs. retrieved code. In addition, we also record the user edits after inserting the code snippet from the plugin, along with the code context for the analysis on post edits required after using the plugin.

Participant Perceptions of Tool Use (RQ\( _{{\bf 3}} \)). We ran short post-test surveys after every task and a final post-test survey at the end of the study as a whole (see Appendix D for instruments) to collect data on the participants’ subjective impressions of using the NL2code plugin and interacting with the code generation and code retrieval systems. We asked Likert-style and open-ended questions about aspects of using the plugin the participants enjoyed and aspects they wish to see improved.

Next, we describe how we analyzed these data and we answer each of our research questions.

Skip 5RQ \( _{{\bf 1}} \) : NL2Code Plugin Effects on Task Completion Time and Program Correctness Section

5 RQ\( _{{\bf 1}} \): NL2Code Plugin Effects on Task Completion Time and Program Correctness

We start by describing our shared data analysis methodology, applied similarly to both variables corresponding to RQ\( _{{\bf 1}} \), then present our results for each variable.

Methodology. Recall, we assign each participant a total of 8 tasks, 2 per task category, based on their experience levels with those categories; in each category, we randomly assign one of the 2 tasks to the NL2Code plugin (treatment) condition and the other task to the no plugin (control) condition. We then compute the three sets of outcome variables above.

The key idea behind our analysis is to compare the distributions of outcome variables between tasks completed in the treatment and control conditions. However, this comparison is not straightforward. First, our study design imposes a hierarchical structure during data collection, therefore the individual observations are not independent—by construction, the same participant will have completed multiple tasks over the course of the study. Moreover, tasks vary in difficulty, again by construction, therefore it is expected that their corresponding response variables, e.g., task completion times, can be correlated with the tasks themselves; e.g., on average, more complex tasks will take longer to complete. Finally, the participants vary in their self reported levels of Python and individual task category experience; we should separate experience-related effects from effects of using the plugin, if any.

Therefore, we use mixed-effects [34] as opposed to the more common fixed-effects regression models to analyze our data. Fixed-effects models assume that residuals are independently and identically distributed, which is an invalid assumption in our case given the hierarchical nature of our data: E.g., responses for the different measurement occasions (tasks) within a given individual are likely correlated; a highly experienced Python programmer completing one task quickly is more likely to complete other tasks quickly as well. Mixed-effects models address this issue by having a residual term at each level, e.g., the observation level and the study participant level, in which case the individual participant-level residual is the so-called random effect. This partitions the unexplained residual variance into two components: higher-level variance between higher-level entities (study participants) and lower-level variance within these entities, between measurement occasions (tasks).

We consider two model specifications for each response variable. Our default model includes random effects for the individual and task, per the rationale above, a fixed effect for task category experience (e.g., participants with more machine learning experience should complete the machine learning task faster, on average), and a dummy variable to indicate the condition (plugin vs. no plugin). For example, for the task completion time response, we estimate the model: 28 (1) \( \begin{align} \texttt {completion\_time} = \texttt {experience} + \texttt {uses\_plugin} + (1 \vert \texttt {user}) + (1 \vert \texttt {task}). \end{align} \)

As specified, our default model may suffer from heterogeneity bias [13]. Task category experience, a higher-level (i.e., individual-level as opposed to observation-level) predictor, varies both within and across study participants: Within participants, experience can vary across the 4 task categories—a user may be more experienced with basic Python than with data science; and across participants, experience with any given task category is likely to vary as well—some participants report higher experience with data science-related tasks than others. This means that experience (a fixed effect) and user (a random effect) may be “correlated.” In turn, this may result in biased estimates, because both the within- and between-effect are captured in one estimate.

There are two sources of variation that can be used to explain changes in the outcome: (1) overall, more experienced programmers may be more efficient at completing tasks (group-level pattern); and (2) when becoming more experienced, programmers may also become more efficient at completing tasks (individual-level pattern). Therefore, to address potential heterogeneity bias, we split our fixed effect (experience) into two variables, each representing a different source of variation: a participant’s average experience across all task categories (experience_btw) and the deviation for each task from the participants’s overall mean experience (experience_wi). This process is known as de-meaning or person-mean centering [34]. This way, mixed-effects models can model both within- and between-subject effects [13], as recommended for a long time by Mundlak [79]. Taking the same task completion time response variable as an example (other variables are modeled analogously), our refined model becomes: (2) \( \begin{align} \texttt {completion_time} = \texttt {experience_btw} + \texttt {experience_wi} + \texttt {uses_plugin} + (1 \vert \texttt {user}) + (1 \vert \texttt {task}). \end{align} \)

In both cases, the estimated coefficient for uses_plugin indicates the effect of using the plugin, while holding fixed the effects of experience and other random user and task effects.

For estimation, we used the functions lmer and lmer.test in R. We follow the traditional level for statistical significance when interpreting coefficient estimates, i.e., \( p \lt 0.05 \). As indicators of goodness of fit, we report a marginal (\( R^2_m \)) and a conditional (\( R^2_c \)) coefficient of determination for generalized mixed-effects models [50, 85], as implemented in the MuMIn package in R: \( R^2_m \) describes the proportion of variance explained by the fixed effects alone; \( R^2_c \) describes the proportion of variance explained by the fixed and random effects together.

Threats to Validity. Besides potential threats to statistical conclusion validity arising from the very nature of the data we are regressing over, discussed above and mitigated through our choice of mixed-effects regression models and their specific designs, we note the standard threats to statistical conclusion validity affecting linear regression models in general. To mitigate these, we take standard precautions. First, we removed as outliers the top 1% most extreme values. Second, we checked for collinearity among the predictors we use the variance inflation factor (VIF) [22]; all were below 3, i.e., multicollinearity is not an issue [58]. Finally, we acknowledge that additional time may be spent as the users are asked to upload their edits, increasing the amount of time necessary in the plugin setting. However the time spent for uploading is minimal as the plugin automatically helps the user to remove the auto-generated comments with only a press of a keyboard shortcut.

Results. Table 3 summarizes our default specification mixed-effects regressions for both response variables; the models with our second specification (de-meaned task experience) are equivalent (see Appendix G). All models include controls for the amount of users’ experience with the respective task categories as well as other random user and task effects. In all cases, the models fit the data reasonably well (ranging from \( R^2_c = 29\% \) for task correctness scores, to \( R^2_c = 64\% \) for task completion time), with most of the variance explained attributable to the two random effects (task and user)—there is significant user-to-user and task-to-task variability in all response variables.

Table 3.
Dependent variable
Completion timeCorrectness score
(1)(2)
Experience-195.620.07
(183.11)(0.24)
Uses plugin15.760.44
(196.11)(0.30)
Constant3,984.51\( ^{***} \)5.88\( ^{***} \)
(838.07)(1.03)
Observations224237
Num users3131
Num tasks1414
sd(user)1,489.250.82
sd(task)1,104.71.14
R2m0.0040.008
R2c0.6420.289
Akaike Inf. Crit.3,987.141,106.66
Bayesian Inf. Crit.4,007.611,127.46
  • Note: \( ^{*} \)p \( \lt \) 0.1; \( ^{**} \)p \( \lt \) 0.05; \( ^{***} \)p \( \lt \) 0.01.

Table 3. LMER Task Performance Models (Default Specification)

  • Note: \( ^{*} \)p \( \lt \) 0.1; \( ^{**} \)p \( \lt \) 0.05; \( ^{***} \)p \( \lt \) 0.01.

Analyzing the models, we make the following observations: First, looking at the completion time model (1), there is no statistically significant difference between the two conditions. Stated differently, we do not find sufficient evidence to conclude that users in the plugin condition complete their tasks with different speed on average than users in the control group, contrary to our expectation.

Second, and this time in line with our expectation, there is no statistically significant difference between the two conditions in task correctness scores (model (2)). That is, the code written by users in the plugin condition appears statistically indistinguishably as correct from the code written by users in the control group.

We investigate more differences between the code written by study participants in each of the two conditions in more detail in the next section.

Skip 6RQ \( _{{\bf 2}} \) : Comparison of Generated vs. Retrieved Code Section

6 RQ\( _{{\bf 2}} \): Comparison of Generated vs. Retrieved Code

In this section, we focus on how study participants are interacting with the code generation and retrieval systems. Specifically, we dive deeper into both the inputs to and the outputs of the plugin, i.e., we analyze the quality of the queries issued by study participants and of the code snippets produced in return, contrasting code generation to retrieval throughout. We analyze these data along three dimensions, detailed next.

6.1 For What Queries Do Users Tend to Favor Generation vs. Retrieval Answers

First, we investigate whether there are any discernible characteristics of the natural language queries (and therefore tasks) that associate with study participants tending to favor the code snippets returned by the code generation model over those returned by the code retrieval model.

Methodology. Using our instrumented environment, we collect all successful queries issued by the study participants, i.e., those for which a code snippet from among the listed candidates was selected, and we record which of the two sources (generation or retrieval) the snippet came from. See Table 10 in Appendix H for the complete set of queries from our 31 participants, organized per task. We then build a binary logistic regression model with snippet source as outcome variable and bag-of-words features of the natural language input queries as predictors.

If this model is able to predict the source of the code snippet better than by chance, then we can conclude that there is some correlation between the type of input query and the users’ preference for generated versus retrieved code snippets. Moreover, the word feature weights in the logistic regression model could shed some light on what features are the most representative of queries that were effectively answered using generation or retrieval. For our analysis, we manually review the top 20 (approximately 7%) contributing query features for each value of the outcome variable (“generation” vs. “retrieval”) and discuss patterns we observe qualitatively, after thematic analysis.

Specifically, for each query, we tokenize it, filter out English stop words, and compute a bag-of-words and bag-of-bigrams vector representation, with each element of the vector corresponding to the number of times a particular word or bigram (two-word sequence) occurred in the query. The number of distinct words in all queries is 302, and the number of distinct bigrams in all queries is 491, and thus the dimensionality of the query vector is 793.29 We then estimate the model: (3) \( \begin{align} Pr(\text{chosen snippet is ''generated''}) & =\frac{\exp ({\bf X}\beta)}{1+\exp ({\bf X}\beta)}, \end{align} \) where \( {\bf X} \) here represents a k-dimensional bag-of-word vector representation of a given query, and \( \beta \) are the weights to be estimated. To this end, we randomly split all the collected query and candidate selection pairs into training (70% of the data) and held-out test (30%) sets. We then train the model using 5-fold cross-validation until it converges, and subsequently test it on the held-out set. We use 0.5 as a cutoff probability for our binary labels. In addition, we also build a trivial baseline model that always predicts “retrieval.”

The baseline model is 55.6% accurate (among the successful queries in our sample there are slightly more code snippets retrieved rather than generated). Our main logistic regression model is 65.9% accurate, i.e., the model was able to learn some patterns of differences between those queries that result in code generation results being chosen over code retrieval ones and vice versa.

Threats to Validity. One potentially confounding factor is that the plugin always displays code generation results first, before code retrieval. Ordering effects have been reported in other domains [102] and could also play a role here. Specifically, users who inspect query results linearly, top-down, would see the code generation results first and might select them more frequently than if the results were displayed in a different order. That is, we might infer that users prefer code generation to retrieval only because they see code generation results first, thus overestimating the users’ preference for code generation versus retrieval.

Even though testing ordering effects experimentally was not practical with our study design, we could test a proxy with our log data—to what extent the code generation results overlap with the code retrieval ones. High overlap could indicate that code retrieval results might have been chosen instead of code generation ones, if presented earlier in the candidates list. Whenever study participants chose a snippet returned by the code generation model, we compared (as strings) the chosen snippet to all candidates returned by the code retrieval engine. Only 6 out of 173 such unique queries (~3.5%) also contained the exact chosen code generation snippet among the code retrieval results, therefore, we conclude that this scenario is unlikely. 30

Another potentially confounding factor is that an icon indicative of generation or retrieval is displayed next to each result in the plugin UI. This means that users know which model produced which candidate snippet and might choose a snippet because of that reason rather than because of the snippet’s inherent usefulness. More research is needed to test these effects. We hypothesize that biases may occur in both directions. On the one hand, holding other variables like ordering fixed, users might prefer code generation results because of novelty effects. On the other hand, users might prefer code retrieval results because of general skepticism towards automatically generated code, as has been reported, e.g., about automatically generated unit tests [33, 103].

Regarding the analysis, we use an interpretable classifier (logistic regression) and follow standard practice for training and testing (cross-validation, held-out test set, etc.), therefore, we do not expect extraordinary threats to validity related to this part of our methodology. However, we do note the typical threats to trustworthiness in qualitative research related to our thematic analysis of top ranking classifier features [88]. To mitigate these, we created a clear audit trail, describing and motivating methodological choices, and publishing the relevant data (queries, top ranking features after classification, etc.). Still, we note potential threats to transferability that may arise if different features or different classifiers are used for training, or a different number/fraction of top ranking features is analyzed qualitatively for themes.

Results. In Table 4, we show the top features that contributed to predicting each one of the two categories, and their corresponding weights. Inspecting the table, we make two observations:

Table 4.
GenerationRetrieval
WeightFeatureWeightFeatureWeightFeatureWeightFeature
0.828open0.352current0.471letters0.294extract
0.742time0.345delete row0.442copy0.289set
0.676sort0.345random number0.438matplotlib0.289plt set
0.590read csv0.339trim0.437datetime0.282read file
0.556list0.330text file0.410python0.282cross-validation
0.507number0.326keys0.365column csv0.274scikit
0.402search0.310round0.361bar0.274dataframe csv
0.399open file0.293numbers0.344copy files0.274sklearn
0.385dictionary0.291row dataframe0.334delete column0.272digit
0.353read0.290load csv0.302write file0.270folders

Table 4. Most Important 20 Features and Their Weights from the Logistic Regression Modeling Whether Successful Plugin Queries Result in Generated or Retrieved Code Snippets

First, we observe that for code generation, the highest ranked features (most predictive tokens in the input queries) refer mostly to basic Python functionality, e.g., “open, read csv, text file” (opening and reading a file), “sort, list, number, dictionary, keys” (related to basic data structures and operations in Python), “random number” (related to random number generation), “trim” (string operations), and so on. For example, some stereotypical queries containing these tokens that result in the code generation snippets being chosen are “open a csv file data.csv and read the data,” “get date and time in gmt,” “list all text files in the data directory,” and so on.

In contrast, we observe that many queries that are more likely to succeed through code retrieval contain terms related to more complex functionality, some usually requiring a series of steps to fulfill. For example, “datetime” (regarding date and time operations), “cross validation, sklearn, column csv” (regarding machine learning and data analysis), “matplotlib” (data visualization), and so on, are all among the top features for queries where users more often chose the code retrieval snippets.

In summary, it seems predictable (substantially more so than by random chance) whether natural language user queries to our NL2Code plugin are more likely to succeed through code generation vs. code retrieval on average, given the contents (words) of the queries.

6.2 How Well-specified Are the Queries

Search is a notoriously hard problem [47, 69], especially when users do not start knowing exactly what they are looking for, and therefore are not able to formulate clear, well-specified search queries. In this subsection, we investigate the quality of the input natural language queries, and attempt to delineate it from the quality of the underlying code generation and retrieval systems—either one or both may be responsible for failures to obtain desirable code snippets for a given task.

Anecdotally, we have observed that input queries to our NL2Code plugin are not always well-specified, even when the participants selected and inserted into their code one of the candidate snippets returned by the plugin for that query. A recurring issue seems to be that study participants sometimes input only a few keywords as their query (e.g., “move file”), perhaps as they are used to interacting with general purpose search engines like Google, instead of more detailed queries as expected by our plugin. For example, study participants sometimes omit (despite our detailed instructions) variable names part of the intent but defined elsewhere in the program (e.g., “save dataframe to csv” omits the DataFrame variable name). Similarly, they sometimes omit flags and arguments that need to be passed to a particular API method (e.g., “load json from a file” omits the actual JSON filename).

Methodology. The key idea behind our investigation here is to replace the underlying code generation and retrieval systems with an oracle assumed to be perfect—a human expert Python programmer—and study how well the oracle could have produced the corresponding code snippet given a natural language input query. If the oracle could successfully produce a code snippet implementing the intent, then we deem the query “good enough,” or well-specified; otherwise, we deem the query under-specified. The fraction of “good enough” queries to all queries can be considered as an upper bound on the success rate of a perfect code generation model.

Concretely, we randomly sampled 50 queries out of all successful queries issued during the user study (see Table 11 in Appendix I for the sample) and had the first author of this article, a proficient programmer with eight years of Python experience, attempt to generate code based on each of them. The oracle programmer considered two scenarios: (1) generating code given the input query as is, without additional context; (2) if the former attempt failed, then generating code given the input query together with the snapshot of the source file the study participant was working in at the time the query was issued, for additional context.

For each query, we record three binary variables: two indicating whether each of the oracle’s attempts succeeded, without and with additional context, respectively,31 and the third indicating whether the code snippet actually chosen by the study participant for that query came from the code generation model or the code retrieval one; see Table 11 in Appendix I. 32

We then measure the correlation, across the 50 queries, between each of the two oracle success variables and the code snippet source variable, using the phi coefficient \( \phi \) [23], a standard measure of association for two binary variables similar to the Pearson correlation coefficient in its interpretation. This way, we can assess how close the code generation model is from a human oracle (the good enough as is scenario) and whether contextual information from the source code the developer is currently working on might be worth incorporating into code generation models in the future (the good enough with context scenario); note that the code generation model we used in this study [117, 124] does not consider such contextual information.

Threats to Validity. We follow standard practice for the statistical analysis in this section, therefore, we do not anticipate notable threats to statistical conclusion validity. Due to the limitations of our telemetry system, we did not record unsuccessful queries (i.e., queries that the user entered but no candidate is selected). As a result, queries that favor neither generation nor retrieval cannot be compared. However, we acknowledge three other notable threats to validity. First, we used only one expert programmer as oracle, which may introduce a threat to construct validity given the level of subjectivity in determining which queries are “good enough.” To mitigate this, we discussed among the research team, whenever applicable, queries for which the expert programmer was not highly confident in the determination. Second, our random sample of 50 queries manually reviewed by the expert programmer is only representative of the population of 397 queries with 95% confidence and 13% margin of error, which may introduce a threat to internal validity. However, the relatively small sample size was necessary for practical reasons, given the high level of manual effort involved in the review. Finally, we note a potential threat to construct validity around the binary variable capturing the source (generation or retrieval) of the candidate code snippets selected by the study participants. There is an implicit assumption here that study participants know what the right answer (code snippet) should be given a natural language query and are able to recognize it among the candidates provided by the NL2Code plugin; therefore, we assume that the snippet source variable captures actual quality differences between code snippets produced by the generation and retrieval models, respectively. However, this may not be the case. To test this, we reviewed all the candidate snippets returned by the plugin for the first 6 among the 50 queries analyzed. Across the \( 6 \cdot 2 \text{ models (generation/retrieval)} \cdot 7 \text{ candidates per model} = 84 \text{ candidate snippets} \), we only discovered one case where the study participant could have arguably chosen a more relevant snippet. Therefore, we expect the incidence of violations of this assumption to be rare enough to not materially affect our results.

Results. Table 5 shows contingency tables for each of the two oracle comparison scenarios. Note that the “good enough with context” category includes all queries that are “good enough as is,” by construction. Inspecting the results in the table, we make the following observations:

Table 5.
SnippetQuery
GenerationGood enough as isGood enough w/ context
FalseTrueFalseTrue
False2381516
True712118
  • See Table 10 in Appendix H for the actual queries.

Table 5. Contingency Tables for the Two Oracle Comparison Scenarios in Section 6.2

  • See Table 10 in Appendix H for the actual queries.

First, the natural language queries analyzed are more often than not insufficiently well-specified for even the human expert to be able to write code implementing those intents; only 20 out of 50 queries (40%) are deemed “good enough as is” by the oracle. Representative examples of failures from Table 11 are the queries consisting of a few keywords (e.g., “csv writer,” “defaultdict”) rather than queries containing sufficient details about the user’s intent (e.g., “remove first column from csv file”). Considering the source file the user was editing at query time helps, with 34 (68%) of the queries now being deemed “good enough with context” by the oracle.

Second, there is moderately high and statistically significant association between the success of the code generation model (i.e., the study participant choosing one of those candidate code snippets) and the quality of queries in both scenarios: \( \phi = 0.37 \) (\( p = 0.008 \)) for already well-specified queries and \( \phi = 0.45 \) (\( p = 0.001 \)) for queries that become informative enough given additional context. This suggests that input query quality can have a big impact on the performance of the code generation model, and that incorporating additional contextual information may help.

Analyzing the failure rate of the code generation model (generation = False), we observe that it is relatively high in general (31 out of 50 queries, or 62%). However, most of these cases are in response to under-specified queries (23 out of the 31 failures; 74%), for which even the human oracle failed to generate the corresponding code. Still, there are 8 (26%) failure cases where the human expert could directly implement the natural language intent without additional context: “date now,” “for loop on range 100,” “generate random letters,” “get now one week from now,” “get time and date,” “open “data.csv” file,” “how to remove an item from a list using the index,” and “plt create 3 subplots.” All but the last one seem to refer to basic Python functionality. These queries are targets where further improved code generation techniques could improve the utility of the plugin.

Interestingly, we also observe a non-trivial number of under-specified queries (7 out of 30; 23%) for which the code generation model succeeded despite the human oracle failing: “call pick\( \_ \)with\( \_ \)replacement,” “copy a file to dist,” “pandas round value,” “pandas to csv,” “rename column pandas,” “plt ax legend,” and “scatter.”

6.3 How Much the Code Snippets Are Edited after Plugin Use

Choosing (and inserting into the IDE source file) one of the candidate code snippets returned by the NL2Code plugin indicates that the code snippet was generally useful. However, while useful, the code snippet may still be far from an ideal solution to the user’s query. To get a sense of how appropriate the accepted code snippets are given the user intent, we compare the distributions of snippet lengths before (i.e., as returned by the plugin) and after potential edits in the IDE.

Methodology. When inserting a code snippet a user selected from among the plugin-returned candidates, we also insert special code comments in the source file around the snippet to mark the start and end of the code fragment corresponding to that particular intent (as shown in Figure 3). Study participants are instructed to use a certain key combination when they are done editing that code fragment to remove the delimiters and submit the edited version of the code fragment back to our server. Our analysis in this section compares the length of code snippets and types of tokens present between these two versions.

Specifically, we first tokenize and tag each version of a code snippet using a Python tokenizer and then compare the pairs of distributions of lengths before and after edits for code snippets originating from each of the two underlying models, generation and retrieval, using the non-parametric Wilcoxon signed-rank test; in addition, as a measure of effect size, we compute the median difference between members of the two groups, i.e., the Hodges–Lehman estimator [46]. We also compute and report on the Levenshtein edit distance between the two versions, in terms of number of tokens. Figure 6 visualizes these different distributions.

Fig. 6.

Fig. 6. Split violin plots comparing the length (in tokens) of the code snippets chosen by the study participants across all successful queries, before and after potential edits in the IDE. The horizontal dotted lines represent 25% and 75% quartiles, and the dashed lines represent medians.

=-1 Threats to Validity. We note two potential threats to construct and external validity related to the analysis in this section. First, we have no way of enforcing that study participants contain their code edits related to a particular intent to the section of the source file specially delimited by code comments for this purpose. One may include unrelated edits in the same code region or make related edits outside of the designated region. Therefore, our measurement of snippet length post edits may not accurately reflect the construct of snippet length as related to a particular intent. To mitigate this, we gave clear instructions to participants at the beginning of the study and manually reviewed a small sample of the edited versions of a snippet, not discovering any obvious noise. Second, not all study participants followed our instructions every time they used the plugin and submitted their final (edited or not) version of the snippet back to our server. Only 303 out of the 397 successful queries recorded (76.3%) had final code snippets uploaded back to our server. Since this was not a random sample, our findings on this sample may not generalize to the entire population of 397 successful queries. To assess the severity of this potential threat, we compared the distributions of plugin-returned code snippet lengths between all successful queries and just the 303 queries where study participants uploaded their edits onto our server; for both generated (Wilcoxon \( p = 0.54 \)) and retrieved (\( p = 0.93 \)) code snippets, we found the respective two distributions statistically indistinguishable, therefore, we expect this to not be a sizable threat to validity.

Results. Comparing the two distributions of token lengths for acceptable code snippets from the code generation model before and after edits, we do not find any statistically significant differences in their mean ranks (\( p = 0.345 \)). The mean edit distance between the two versions of these snippets is 5.2 tokens (min 0, max 130, median 1).

In contrast, comparing the two distributions of token lengths for acceptable code snippets from the code retrieval engine before and after edits, we find a statistically significant difference in their mean ranks (\( p = 1.195 \times 10^{-07} \)). The Hodges–Lehman median difference between the edited and unedited versions of these snippets is 18 tokens, with a 95% confidence interval from 11 to 23 tokens. The edit distance metric paints a similar picture—acceptable code snippets from the code retrieval engine, before and after edits, are at a mean edit distance of 13.2 tokens from each other (min 0, max 182, median 0).

We also note that code retrieval snippets tend to be longer than code generation ones both before (\( p \lt 2.2 \times 10^{-16} \); median difference 18 tokens, with a 95% confidence interval from 14 to Infinity) and after edits (\( p = 2.657 \times 10^{-14} \); median difference 10 tokens, with a 95% confidence interval from 7 to Infinity). This may help explain why the retrieved snippets require more edits to correct the code to better suit the current programming code context, compared to the generated snippets.

Diving deeper into the edits to the plugin-supplied version of the different snippets, we compute the frequency distribution of tokens in both versions (plugin and final), normalized based on total token count in each corpus. Table 6 highlights the tokens with the greatest increases and decreases in relative frequency during editing. We observe that study participants seem to add common keywords such as “in, for, if, with,” built-in names and functions such as “key, print,” and common variable names such as “line, filename” to the generated/retrieved candidates. Stated differently, in these cases the code snippets seem to miss substantive parts and relevant functionality, which also may be partly due to the lack of specificity described in the previous section.

Table 6.
AdditionDeletion
\( \Delta \)Freq.Token\( \Delta \)Freq.Token\( \Delta \)Freq.Token\( \Delta \)Freq.Token
0.0040in0.0016w–0.00712–0.0016In
0.0037for0.0015with–0.00711–0.001611
0.0030line0.0015–0.0043a–0.0015y
0.0024file0.0015days–0.00380–0.0014Seattle
0.0023key0.0015cur_v–0.00343–0.001412
0.0023os.path.join0.0015company_info–0.0025plt–0.00134
0.0021dic0.0015n–0.002350–0.0013iris
0.0021filename0.0014output–0.0021id_generator–0.0013string.digits
0.0018print0.0014codecs.open–0.0018Out–0.001310
0.0017if0.0014v–0.0017df–0.0013matplotlib.pyplot

Table 6. Most Frequently Added/Deleted Tokens after User Edits to Plugin-returned Code Snippets

In contrast, study participants seem to delete number and string literals from the code snippets. This may be explained by the fact that the tool used retrieved code snippets as they appeared on Stack Oveflow, and thus many retrieved code snippets contain additional boilerplate code required for initialization or setup and hard-coded example inputs and outputs. We also observe some commonly used variable names like “df, plt” that get deleted, suggesting that variable replacement is one of the common operations when reusing the code snippets. An interesting observation here is that “In” and “Out” are getting deleted frequently. We find that it is mostly due to some of the code snippets retrieved from Stack Overflow being in the format of IPython REPL, which uses “In” and “Out” to separate the Python source code and execution outputs. When integrating these snippets, the users will have to remove this superfluous text. Figure 7 shows a representative example of such user edits after selecting a candidate snippet, which involves deleting IPython REPL contents, variable replacement and addition, as well as literal replacements.

Fig. 7.

Fig. 7. Representative example of user edits to a code snippet retrieved from Stack Overflow.

Furthermore, following the previous observations on actual tokens, we are interested in how the frequency of different types of tokens changes before and after users edit the plugin-returned code snippets. We use the tokenize33 Python 3 library to parse and tag the code snippets and compare the frequency changes by token type, similar to the previous analysis. 34 The results are shown in Table 7. We find that users add new NAME (identifiers, keywords) tokens the most, with the frequency of STRING (string literal) tokens slightly increased, and COMMENT (comment strings) tokens staying roughly the same after the edits. NUMBER (numeric literal) tokens are deleted the most, in line with the observation above, again suggesting that many plugin-returned snippets are not tailored to specific identifiers and parameters that the user desires. Interestingly, we also see a slight decrease in frequency of NEWLINE tokens, representing a decrease in the number of logical lines of Python code after edits. This suggests that the plugin-returned code snippets are not concise enough in some cases.

Table 7.
\( \Delta \)Freq.Type\( \Delta \)Freq.Type\( \Delta \)Freq.Type\( \Delta \)Freq.Type
0.0138NAME0.0053DEDENT0.0004COMMENT–0.0095OP
0.0053INDENT0.0022STRING–0.0049NEWLINE–0.0248NUMBER
  • Sorted in descending order, positive number represents addition and negative number represents deletion.

Table 7. Frequency Changes of Different Token Types after User Edits to Plugin-returned Code Snippets

  • Sorted in descending order, positive number represents addition and negative number represents deletion.

Skip 7RQ \( _{{\bf 3}} \) : User Perceptions of the NL2Code Plugin Section

7 RQ\( _{{\bf 3}} \): User Perceptions of the NL2Code Plugin

Our last research question gauges how study participants perceived working with the NL2Code plugin, their pain points, and their suggestions for improvement.

Methodology. As part of our post-test survey, we asked the participants open-ended questions about what worked well when using the plugin and, separately, what they think should be improved. In addition, we asked participants to rate their overall experience using the plugin on a Likert scale, ranging from 1 (very bad) to 5 (very good). We then qualitatively coded the answers to open-ended questions to identify themes in the responses for the 31 who completed all their assigned tasks.

Threats to Validity. We acknowledge usual threats to trustworthiness and transferability from qualitatively analyzing a relatively small set of open-ended survey data [88], as also discussed above. In particular, we note that only one researcher was involved in coding. To mitigate these threats, we release all verbatim survey responses as part of our replication package.

Results. Overall, study participants report having a neutral (15/31; 48.4%) or at least somewhat positive (15/31; 48.4%) experience using the NL2Code plugin, with only one participant rating their experience as somewhat negative.

Among the aspects the participants report as positive, we distill two main themes:

The plugin helps find code snippets the developer is aware of but cannot fully remember. (P1, P2, P8, P10, P11, P19, P20, P21, P22, P30, P31) These tend to be small commands or less familiar API calls and API usage patterns that users have seen before. Two participants summarize this well:

  • “On a few occasions, the plugin very conveniently gave me the snippet of code I was looking for, [which] was “on the tip of my tongue.” (P10)

  • “Sometimes I just cannot remember the exact code, but I remember the shape. I could select the correct one easily.” (P2)

Respondents expressed appreciation for both the generation and retrieval results, and there was little expression of preference for one method over the other, e.g.:

  • “Even just having the snippets mined from Stack Overflow visible in the IDE was a good memory refresher / source of ideas.” (P10)

  • “It was somewhat convenient to not have to switch tabs to Google things, ..., based on my memory, that most of the suggestions I got were from the internet anyway.” (P5)

  • “It has all resources needed at one place.” (P6)

Using an in-IDE plugin is less disruptive than using a web browser. (P1, P4, P5, P6, P7, P10, P18, P20, P24, P27) Many of our respondents who were positive about the plugin reiterate expected context-switching benefits of not leaving the IDE while programming, e.g.:

  • “I like that the plugin stops me having to go and search online for solutions. [...] It can be very easy to get distracted when searching for solutions online.” (P20)

  • “Compared with manual search, this is faster and less disruptive.” (P1)

Participants also describe many aspects of the plugin that could be improved.

The quality of code generation and retrieval results could be higher. (P3, P4, P5, P7, P9, P13, P14, P23, P27, P29, P31) Respondents mentioned that it was “rare” (P7) when they could directly use code from plugin, without modifications. In some cases, results from the plugin were “not related to the search” (P14), and users “didn’t find what [they were] searching for” (P31). As one respondent humbly summarized it:

  • “The model needs some improvements.” (P4)

The insufficient quality of the plugin’s results was especially felt as the tasks became more complex and involved APIs with complex usage patterns. One participant summarized this well:

  • “For easy tasks, like walking through a directory in the filesystem, the plugin saves me time because what I did previously was to go to Stack Overflow and copy the code. But for difficult tasks like data processing or ML, the plugin is not helpful. Most snippets are not useful and I have to go to the website of sklearn to read the full doc to understand what I should do.” (P3)

A particular related pain point is that the snippets from the code retrieval engine often contain spurious elements (as also noted above). In one participant’s words:

  • “When inserting the code into my program, I would like to **not** copy the input/output examples, and I can’t imagine ever wanting those in the program itself.” (P5)

Users could benefit from additional context. (P3, P5, P8, P18, P19, P20, P24, P26, P27) Some respondents mention that it would be useful to include additional (links to) explanations and documentation alongside the returned code snippets so the user could understand what the snippet is supposed to do, or even “which of the suggestions is the correct one when you are not familiar with a module” (P11). In two participants’ words:

  • “It would be nice if the examples from the internet could contain the relevant context of the discussion (e.g., things to consider when using this suggestion), as well as the input/output examples.” (P5)

  • “I hope the generated code snippet can have more comments or usage [examples]. Otherwise I still need to search the web to understand what it is.” (P3)

A closely related theme is that using the plugin assumes one has a “good background understanding of the underlying principles/modules/frameworks” (P11), and they primarily need help with “look[ing] up little syntax bits that you have forgotten” (P11). (P1, P11, P16, P25) One participant was especially critical:

  • “For more complex problems, I think the plugin does not help at all, because the programmer needs to know the theoretical background.” (P16)

The plugin could benefit from additional context. (P4, P9, P10, P17, P30) Some participants suggest that the plugin could be “smarter” if it becomes more aware of the local context in the developer’s IDE, e.g.:

  • “Sometimes I want to generate an expression to be inserted somewhere, to be assigned to a variable, or to match the indentation level, without having to tell the plugin this explicitly. I didn’t feel like the plugin was aware of context.” (P10)

Participants also comment on how the plugin’s query syntax takes some getting used to (P2, P12, P15), referring in particular to the way the code generation model expects queries to include variables, while the web search code retrieval engine allows users to only use keywords. For example:

  • “[It became] useful to me towards the end when I got the hang of it and could formulate questions in the correct way (which I feel is somewhat of a skill in itself).” (P15)

  • “It is not very natural for me to ‘instantiate’ my questions, I mostly like to search [using] keywords or just a description of what I want to achieve.” (P2)

Querying the plugin could be interactive. (P11, P20, P30) Finally, some participants suggest to make querying interactive, dialogue-based, rather than unidirectional. This could with refining queries until they are sufficiently well-specified, or to decompose complex functionality into smaller steps, e.g.:

  • “A chatbot [...] could identify the rough area in which the user needs assistance, [and] could help narrow it down further, helping to pinpoint an exact solution.” (P20)

Skip 8DISCUSSION AND IMPLICATIONS Section

8 DISCUSSION AND IMPLICATIONS

Recent years have seen much progress from machine learning and software engineering researchers developing techniques to better assist programmers in their coding tasks, which exploit the advancements in (deep) learning technology and the availability of very large amounts of data from Big Code repositories such as GitHub and Stack Overflow. A particularly promising research direction in this space has been that addressing the decades-old problem of “natural language programming” [26], i.e., having people instruct machines in the same (natural) language they communicate in with each other, which can be useful in many scenarios, as discussed in the Introduction. However, while excited about this research direction and actively contributing to it ourselves, we are also questioning whether the most impact from such work can be had by focusing primarily on making technological advancements (e.g., as we write this, a one-trillion parameter language model has just been announced [28], as only the most current development in a very rapidly evolving field) without also carefully considering how such proposed solutions can fit within the software development workflow, through human-centered research.

In this spirit, we have presented the results of a controlled experiment with 31 participants with diverse background and programming expertise, observed while completing a range of Python programming tasks with and without the help of a NL2Code IDE plugin. The plugin allows users to enter descriptions of intent in natural language, and have corresponding code snippets, ideally implementing said intent, automatically returned. We designed the plugin with two research goals in mind. First, we sought to evaluate, to our knowledge for the first time using a human-centered approach, the performance of some NL2Code generation model with state-of-the-art performance on a benchmark dataset, but unknown performance “in the wild.” Second, we sought to contrast the performance and user experience interacting with such a relatively sophisticated model to those of a relatively basic NL2Code retrieval engine, which “merely” retrieves existing code snippets from Stack Overflow given natural language search queries. This way, we could estimate not only how far we are from not having to write any code while programming, but also how far we have come on this problem given the many recent advancements in learning and availability of datasets.

Main Results. Overall, our results are mixed. First, after careful statistical analysis in RQ\( _{{\bf 1}} \), comparing tasks completed with and without using the NL2Code plugin (and either of its underlying code generation or retrieval systems), we found no statistically significant differences in task completion times or task correctness scores.

The results for code metrics (SLOC and CC) can be seen as mixed. On the one hand, the code containing automatically generated or retrieved fragments is not, on average, any more complex or any less maintainable than the code written manually, insofar as the CC and SLOC metrics can distinguish. On the other hand, one could have expected the opposite result, i.e., that since NL2Code tools are typically trained on idiomatic code, using them should lead to “better,” more idiomatic code overall, which might suggest lower SLOC and CC values, on average.

Among the possible explanations for why we do not find supporting evidence for the “better code” hypothesis, two stand out: (i) the two metrics are only crude approximations of the complex, multifaceted concept of code quality; and (ii) even when writing code “manually,” developers still consult the Web and Stack Overflow (i.e., the same resources that these NL2Code tools were trained on) and copy-paste code therein. To better understand the interaction between using the plugin and using a traditional Web browser, we used the event logs from our instrumented environment and compared the distributions of in-browser Web searches between tasks where the 31 study participants used the NL2Code plugin (median 3, mean 5, min 0, max 35 searches per user per task) and tasks where they did not (median 4, mean 7, min 0, max 48). A mixed-effects regression model similar to the ones in Section 5, controlling for individual self-reported experience and with random effects for user and task, reveals a statistically significant effect of using the plugin on the number of in-browser Web searches: On average, using the plugin is associated with 2.8 fewer in-browser Web searches; however, this effect is smaller than the standard deviation of the random user intercept (~4 in-browser Web searches). We conclude that developers still search the Web when using the plugin, even if slightly less than when not using the plugin.

Using a similar argument, the result for task correctness scores can be seen as mixed. Code containing automatically generated or retrieved snippets is not, on average, any less appropriate for a given task as per our rubric than code written manually. However, using the NL2Code plugin does not seem to help our study participants significantly improve their scores either, despite there being room for improvement. Even though across our sample the median score per task was 7 out of 10 when using the plugin and 6 when not using the plugin, the multivariate regression analysis did not find the difference to be statistically significant.

The result for task completion times can be seen as negative and, thus, is perhaps the most surprising of our results: On average, study participants do not complete their tasks statistically significantly faster when using the NL2Code plugin compared to when they are not using it. There are several possible explanations for this negative result. First, we acknowledge fundamental limitations of our study design, which we hope future researchers can improve on. In particular, our tasks, despite their diversity and, we believe, representativeness of real-world Python use, may not lend themselves sufficiently well to NL2Code queries and, therefore, study participants may not have sufficient opportunities to use, and benefit from, the plugin. Moreover, our study population (31 participants) may not be large enough for us to detect effects with small sizes, should they exist.

However, even with these limitations, considering also our results for RQ\( _{{\bf 2}} \) and RQ\( _{{\bf 3}} \), we argue that another explanation is plausible: Our NL2Code plugin and its main underlying code generation technology, despite state-of-the-art (BLEU-score) performance on a benchmark dataset, is not developed enough to be markedly useful in practice just yet. Our telemetry data (RQ\( _{{\bf 2}} \)) shows not only that study participants still carry out in-browser Web searches even though the NL2Code plugin was available, as discussed above, but also that the code snippets returned by the plugin, when used, undergo edits after insertion in the IDE, suggesting insufficient quality to begin with. Our qualitative survey data (RQ\( _{{\bf 3}} \)) paints a similar picture of overall insufficient quality of the NL2Code results.

Implications. While our study suggests that state-of-the-art learning-based natural language to code generation technology is ways away from being useful in practice, our results should be interpreted more optimistically.

First, we argue that the problem is worth working on. In contemporary software development, which involves countless and constantly changing programming languages and APIs, natural language can be a useful medium to turn ideas into code, even for experienced programmers. A large fraction of our study participants commended NL2Code developer assistants for helping them remember the precise syntax or sequence of API calls and their arguments, required to implement some particular piece of functionality. When integrated into the development workflow, e.g., through an IDE plugin, such systems can help developers focus by reducing the need for context switching, further improving their productivity. Our quantitative task performance results for the current version of this NL2Code plugin, while negative, do not imply that future, better performing such systems will also not be markedly useful in practice; the qualitative data from our our study participants already suggests otherwise, as does quantitative data from prior research on the usefulness of in-IDE code search plugins [92].

Second, we argue that this particular style of code generationis worth working on. Our analysis of input queries and resulting code snippets for RQ\( _{{\bf 2}} \) shows that the code generation model produces fundamentally different results than the (simple) code retrieval engine we used for comparison, and that study participants choose snippets returned by the code generation model almost as frequently as they do snippets from the code retrieval engine. In turn, this suggests that, at least within the scope of the current study, one type of model cannot be used as a substitute for the other. As discussed above, the code generation model does almost always produce different results than the code retrieval model. However, it was unclear from that analysis whether the generated code snippets reflect some fundamentally higher level of sophistication inherent to the code generation model, or whether the code retrieval engine we used for comparison is simply too naive.

To further test this, we performed an additional analysis. Specifically, we looked up the chosen code generation snippets in the manually labeled Stack Overflow dataset used for training the code generation model to assess whether the model is simply memorizing the training inputs. Only 13 out of the 173 unique queries (~7.5%) had as the chosen code fragment snippets found verbatim in the model’s training dataset. Therefore, the evidence so far suggests that the code generation model does add some level of sophistication, and customization of results to the developers’ intent (e.g., composing function calls), compared to what any code retrieval engine could.

Third, we provide the following concrete future work recommendations for researchers and toolsmiths in this area, informed by our results:

  • Combine code generation with code retrieval. Our results suggest that some queries may be better answered through code retrieval techniques, and others through code generation. We recommend that future research continue to explore these types of approaches jointly, e.g., using hybrid models [40, 41] that may be able to combine the best of both worlds.

  • Consider the user’s local context as part of the input. Our oracle comparison revealed that users’ natural language queries can often be disambiguated by considering the local context provided by the source files they were working in at the time, which in turn could lead to better performance of the code generation model. There is already convincing evidence from prior work that considering a user’s local context provides unique information about what code they might type next [111]. In addition, some work on code retrieval has also considered how to incorporate context to improve retrieval results [17]; this may be similarly incorporated.

  • Consider the user’s local context as part of the output. Considering where in their local IDE users are when invoking an NL2Code assistant can also help with localizing the returned code snippets for that context. Some transformations are relatively simple, e.g., pretty printing and indentation. Other transformations may require more advanced program analysis but are still well within reach of current technology, e.g., renaming variables used in the returned snippet to match the local context (the Bing Developer Assistant code retrieval engine [115] already does this), or applying coding conventions [2].

  • Provide more context for each returned snippet. Our study shows that NL2Code generation or retrieval systems can be useful when users already know what the right answer is, but they need help retrieving it. At the same time, many of our study participants reported lacking sufficient background knowledge, be it domain-specific or API-specific, to recognize when a plugin-returned code snippet is the right one given their query, or what the snippet does in detail. Future research should consider incorporating more context and documentation together with the plugin’s results, which allows users to better understand the code, e.g., links to Stack Overflow, official documentation pages, explanations of domain-specific concepts, other API usage examples. One example of this is the work of Moreno et al. [78], which retrieves usage examples that show how to use a specific method.

  • Provide a unified and intuitive query syntax. We observed that users are not always formulating queries in the way that we would expect, perhaps because they are used to traditional search engines that are more robust to noisy inputs and designed for keyword-based search. The NL2Code generation model we experimented with in this study was trained on natural language queries that are not only complete English sentences, but also include references to variables or literals involved with an intent, specially delimited by dedicated syntax (grave accents). As our respondents commented in the post-test survey, getting used to formulating queries this way takes some practice. Future research should consider not only what is the most natural way for users to describe their intent using natural language, but also how to provide a unified query syntax for both code generation and code retrieval, to minimize confusion. Robust semantic parsing techniques [8, 95] may also help with interpreting ill-specified user queries.

  • Provide dialogue-based query capability. Dialogue-based querying could allow users to refine their natural language intents until they are sufficiently precise for the underlying models to confidently provide some results. Future systems may reference work on query reformulation in information retrieval, where the user queries are refined to improve retrieval results both for standard information retrieval [7] and code retrieval [39, 45]. In addition, in the NLP community there have been notable advancements recently in interactive semantic parsing [51, 119], i.e., soliciting user input when dealing with missing information or ambiguity while processing the initial natural language query, which could be of use as well.

  • Consider new paradigms of evaluation for code generation and retrieval systems. Usage log data, such as the ones we collected here, is arguably very informative and useful for researchers looking to evaluate NL2Code systems. However, compared to automated metrics such as BLEU, such data is much less readily available. We argue that such data is worth collecting even if only in small quantities. For example, with little but high-quality data, one could still train a reranker [125] to try to select the outputs that a human user selected; if the predictive power exceeds that of BLEU alone, then the trained reranker could be used to automatically evaluate the quality of the generated or retrieved code more realistically than by using BLEU.

Skip 9RELATED WORK Section

9 RELATED WORK

Finally, we more extensively discuss how this work fits in the landscape of the many other related works in the area.

9.1 NL2Code Generation

While we took a particular approach to code generation, there are a wide variety of other options. Researchers have proposed that natural language dialogue could be a new form of human-computer interaction, since nearly the advent of modern computers [26, 35, 44, 76]. The bulk of prior work either targeted domain-specific languages (DSLs), or focused on task-specific code generation for general-purpose languages, where more progress could be made given the relatively constrained vocabulary and output code space. Examples include generating formatted input file parsers [63]; structured, idiomatic sequences of API calls [96]; regular expressions [60, 74, 90]; string manipulation DSL programs [100]; card implementations for trading card games [68]; and solutions to the simplest of programming competition-style problems [10].

With the recent boom of neural networks and deep learning in natural language processing, generating arbitrary code in a general-purpose language [123, 124] are becoming more feasible. Some have been trained on both official API documentation and Stack Overflow questions and answers [117]. There are also similar systems 35 able to generate class member functions given natural language descriptions of intent and the programmatic context provided by the rest of the class [49], and to generate the API call sequence in a Jupyter Notebook code cell given the natural language and code history up to that particular cell [1].

9.2 NL2Code Retrieval

Code retrieval has similarly seen a wide variety of approaches. The simplest way to perform retrieval is to start with existing information retrieval models designed for natural language search and adapt them specifically for the source code domain through query reformulation or other methods [39, 45, 52, 71, 113, 115]. Other research works utilize deep learning models [4, 37, 47, 48] to train a relevance model between natural language queries and corresponding code snippets. It is also possible to exploit code annotations to generate additional information to help improve code retrieval performance [120] or extracted abstract programming patterns and associated natural language keywords for more content-based code search [52]. Many of the models achieve good performance on human-annotated relevance benchmark datasets between natural language and code snippets. Practically, however, many developers simply rely on generic natural-language search engines like Google to find appropriate code snippets by first locating pages that contain code snippets through natural language queries [104] on programming QA websites like Stack Overflow.

9.3 Evaluation of NL2Code Methods

To evaluate whether NL2Code methods are succeeding, the most common way is to create a “reference” program that indeed implements the desired functionality, and measure the similarity of the generated program to this reference program. Because deciding whether two programs are equivalent is, in the general case, undecidable [101], alternative means are necessary. For code generation in limited domains, this is often done by creating a small number of input-output examples and making sure that the generated program returns the same values as the reference program over these tests [15, 59, 114, 118, 126, 127, 128, 129, 130]. However, when scaling to broader domains, creating a thorough and comprehensive suite of test cases over programs that have a wide variety of assumptions about the input and output data formats is not trivial.

As a result, much research work on code generation and retrieval take a different tack. Specifically, many code generation methods [1, 49, 117, 123] aim to directly compare generated code snippets against ground truth snippets, using token sequence comparison metrics borrowed from machine translation tasks, such as BLEU score [89]. However, many code snippets are equivalent in functionality but differ quite largely in terms of token sequences, or differ only slightly in token sequence but greatly in functionality, and thus BLEU is an imperfect metric of correctness of a source code snippet [110].

Code retrieval, however, is the task of retrieving relevant code given a natural language query, which is related to other information retrieval tasks. Since code retrieval is often used to search for vague concepts and ideas, human-annotated relevance annotations are needed for evaluation. The common methods used in research work [37, 47, 121] compare the retrieved code snippet candidates given a natural language query, with a human-annotated list of code snippet relevance, using common automatic information retrieval metrics such as NDCG, MRR, and so on [73]. The drawback of this evaluation method is that the cost of retrieval relevance annotation is high and often requires experts in the specific area. Also, since the candidate lists are usually long, only a few unique natural language queries could be annotated. For example, one of the most recent large-scale code search challenge CodeSearchNet [47] contains only 99 unique natural language queries, along with their corresponding code snippet relevance expert annotations, leading to smaller coverage of real-world development scenarios in evaluation.

Regardless of the automatic metrics above, in the end our final goal is to help developers in their task of writing code. This article fills the gap of the fundamental question of whether these methods will be useful within the developer workflow.

9.4 In-IDE Plugins

Similarly, many research works on deploying plugins inside IDEs to help developers have been performed. Both Ponzanelli et al. [91] and Ponzanelli et al. [92] focus on reducing context switching in IDE by incorporating Stack Overflow by using the context in the IDE to automatically retrieve pertinent discussions from Stack Overflow. Subramanian et al. [109] propose a plugin to enhance traditional API documentation with up-to-date source code examples. Rahman and Roy [97] and Liu et al. [70] design the plugin to help developers find solutions on the Internet to program exceptions and errors. Following the similar route, Brandt et al. [16] study opportunistic programming where programmers leverage online resources with a range of intentions, including the assistance that could be accessed from inside the IDE.

Besides plugin developed to reduce context-switching to other resources in developer workflows, Amann et al. [5] focus on collecting data of various developer activities from inside the IDE that fuel empirical research on the area [94].

This article proposes an in-IDE plugin that incorporates code generation in addition to code retrieval to test the user experience in the real development workflow. In the meantime it also collects fine-grained user activities interacting with the plugin as well as editing the code snippet candidates to provide public data for future work.

9.5 End-user Development

The direction of exploring using natural language intents to generate code snippets is closely related to end-user development [67], which allows end-users (people who are not professional software developers) to program computers. The work of Cypher et al. [24] is among the first that enables end-user to program by demonstration.

Traditionally, programming has been performed by software developers who write code directly in programming languages for the majority of functionality they wish to implement. However, acquiring the requisite knowledge to perform this task requires time-consuming training and practice, and even for skilled programmers, writing programs requires a great amount of time and effort. To this end, there have been many recent developments on no-code or low-code software development platforms that allow both programmers and non-programmers to develop in modalities of interaction other than code [105]. Some examples include visual programming languages such as Scratch [72], which offers a building-block style graphical user interface to implement logic. In specific domains such as user interface design and prototyping, recent advances in deep learning models also enable developers to sketch the user interface visually and then automatically generates user interface code with the sketch [14] or using existing screenshots [87].

Besides visual no-code or low-code programming interfaces, there has also been much progress on program synthesis [12, 29, 31, 108], which uses input-output examples, logic sketches, and so on, to automatically generate functions, with some recent advances that use machine learning models [10, 21, 27, 106]. Some works also generate programs from easier-to-write pseudo-code [59, 129].

There are other works in the area. Barman et al. [11], Chasins et al. [19], 20] make web automation accessible to non-coders through programming by demonstration, while [64, 65, 66] automate mobile applications with multimodal inputs including demonstration and natural language intents. Head et al. [43] combine teacher expertise with data-driven program synthesis techniques to learn bug-fixing code transformations in classroom scenarios. Head et al. [42] help users extract executable, simplified code from existing code. Ko and Myers [55], 56] provide a debugging interface for asking questions about program behavior. Myers and Stylos [82] discuss API designers should consider usability as a step towards enabling end-user programming. Kery et al. [53], Kery and Myers [54] enable data scientists to explore data easily with exploratory programming. Our article’s plugin of using both state-of-the-art code generation and code retrieval to provide more natural programming experience to developers, with the potential future of enabling end-user programming, is related to Myers et al. [81], which envisions natural language programming.

9.6 Code Completion

Many developers use Integrated Development Environments (IDEs) as a convenient solution to help with many aspects during development. Most importantly, many developers actively rely on intelligent code-completion aid like IntelliSense36 for Visual Studio [6, 94] to help developers learn more about the code, keep track of the parameters, and add calls to properties and methods with only a few keystrokes. Many of intelligent code-completion tools also consider the current code context where the developer is editing. With the recent advances in machine learning and deep learning, example tools such as IntelliCode37 for Visual Studio, Codota,38 and TabNine39 present AI-assisted code-suggestion and code-completion based on the current source code context, learned from abundant amounts of projects over the Internet. The scope of our article is to investigate generating or retrieving code using natural language queries, rather than based on the context of the current source code.

Skip 10CONCLUSION Section

10 CONCLUSION

In this article, we performed an extensive user study of in-IDE code generation and retrieval, developing an experimental harness and framework for analysis. This demonstrated challenges and limitations in the current state of both code generation and code retrieval; results were mixed with regards to the impact on the developer workflow, including time efficiency, code correctness, and code quality. However, there was also promise: Developers subjectively enjoyed the experience of using in-IDE developer management tools and provided several concrete areas for improvement. We believe that these results will spur future, targeted development in productive directions for code generation and retrieval models.

APPENDICES

A USER STUDY ENVIRONMENT DESIGN

To control the user study’s development environment for different users as much as possible, and to enable data collection and activity recording outside the IDE (e.g., web browsing activity during the development), we design a complete virtual machine-based environment for users to access remotely and perform the user study on. We build the virtual machine based on a lot of open source software, including Ubuntu 18.04 operating system40 with XFCE 4.1 desktop environment.41 The virtual machine software is VirtualBox 6.1.10,42 and we use Vagrant software43 for automatic virtual machine provisioning.

Inside the Linux virtual machine, we install and configure a set of programs for data collection and workflow control during the user study:

(1)

Python environment. Python 3.644 is installed inside the VM, alongside with pip package manager and several commonly used Python packages for the user study tasks. The user is free to install any additional packages they need during the development.

(2)

IDE with plugin. PyCharm Community Edition 2020.1, with the plugin described in Section 3 is installed. This provides consistent Python development environment for the user study and the testing of the code generation and retrieval. The plugin also handles various data collection processes inside the IDE.

(3)

Man-in-the-middle proxy. We install mitmproxy45 in the VM, along with our customized script sending logs back to our server. This infrastructure enables interception and data collection of both HTTP and secured HTTPS requests. With this, we can collect users’ complete web browsing activities during the user study.

(4)

Web browser. We install Firefox browser,46 configured to use the proxy mentioned above so all users’ browsing activities could be logged for analysis.

(5)

Keylogger. We develop a program that runs in the background during the user study and logs all the user’s keystrokes along with the timestamps to our server. With the keylogger, we can collect data outside the IDE about the users’ activities. This data is useful for mining and analyzing developer activity patterns in terms of keyboard operations, for example, copy and pasting shortcuts.

(6)

User study control scripts. We provide users a handful of scripts for easy and fully automatic retrieval, start and submission of the tasks. The scripts allow user to check their completion status of the whole study, as well as to pause and resume during a task for a break. All the user’s task start, pause, resume, and submission events are logged so the completion time of each task for the user could be calculated.

B PRE-TEST SURVEY DETAILS

For each of the prospective participants, we asked them about two parts of the information in a pre-study survey, apart from personal information for contact purposes. The first is regarding programming experience, used to determine if the participants have enough expertise in Python as well as the categories of tasks that we designed. The questions are:

(1)

Which of the following best describes your current career status: Student (computer science), Student (other field), Software Engineer, Data Scientist, Researcher, Other.

(2)

How do you estimate your programming experience? (1: very inexperienced to 5: very experienced)

(3)

How experienced are you with Python? (1: very inexperienced to 5: very experienced)

(4)

How experienced are you with each of the following tasks in Python? (1: very inexperienced to 5: very experienced) Basic Python, File, OS, Web Scraping, Web Server & Client, Data Analysis & Machine Learning, Data Visualization.

The second part of the information is about their development preferences, used to ask for their preferences with IDE and assistive tools. The questions are:

(1)

What editor/IDE do you use for Python projects? Vim, Emacs, VSCode, PyCharm, Jupyter Notebook, Sublime Text, other.

(2)

Do you use any assistive tools or plugins to improve your coding efficiency? Some examples are code linting, type checking, snippet search tools, etc. If yes, what are they?

C PARTICIPANTS PROGRAMMING EXPERIENCE

The detailed participants’ programming experience responded in the survey is shown in Figure 8.

Fig. 8.

Fig. 8. The experience and expertise for overall Python programming and 7 specific areas that we design different tasks for, from all the participants that completed the survey. 1 represents very inexperienced and 5 represents very experienced.

D POST-STUDY SURVEY DETAILS

After each task, we ask the following questions to all users (disregarding using the plugin or not) about the task design, self-assessment, as well as the help needed during the process:

(1)

How difficult did you feel about the task? (1: very easy to 5: very hard)

(2)

How would you evaluate your performance on the task? (1: very bad to 5: very good)

(3)

How often did you need to look for help during the task, including web search, looking up API references, etc.? (1: not at all to 5: very often)

For users that completed the current task with plugin enabled, the following additional questions about the plugin user experience are asked:

(1)

How do you think the plugin impacted your efficiency timewise, if at all? (1: hindered significantly, to 3: neither hindered nor helped, to 5: helped significantly)

(2)

How do you think the plugin impacted your quality of life, with respect to ease of coding, concentration, etc., if at all? (1: hindered significantly, to 3: neither hindered nor helped, to 5: helped significantly)

After all assigned tasks are completed for the user, we ask them to complete a form about the overall experience with the user study and the evaluation of the plugin, as well as soliciting comments and suggestions.

(1)

What did you think of the tasks assigned to you in general?

(2)

Overall, how was your experience using this plugin? (1: very bad to 5: very good)

(3)

What do you think worked well, compared with your previous ways to solve problems during programming?

(4)

What do you think should be improved, compared with your previous ways to solve problems during programming?

(5)

Do you have any other suggestions/comments for the plugin?

E PLUGIN EFFECT ON CODE COMPLEXITY METRICS

We also analyze the plugin’s effect on code complexity metrics, following the same methods used in Section 5. We measure two standard proxies for code complexity of the Python programs produced by our study participants in each of their assigned tasks, i.e., the number of source lines of code (SLOC) and McCabe’s cyclomatic complexity (CC), a measure of the number of linearly independent paths through a program’s source code [75]; in real programs, CC depends a lot on the “if”-statements, as well as conditional loops and whether these are nested. The two measures tend to be correlated, but not strongly enough to conclude that CC is redundant with SLOC [61]. We use the open-source library Radon47 to calculate CC.

One could expect that code produced by our NL2Code plugin may be more idiomatic (possibly shorter and less complex) than code written by the participants themselves.

Figure 9 shows the distributions of CC values across tasks and conditions. Figure 10 shows the distributions of SLOC values across tasks and conditions.

Fig. 9.

Fig. 9. Distributions of cyclomatic complexity values across tasks and conditions. The horizontal dotted lines represent 25% and 75% quartiles, and the dashed lines represent medians.

Fig. 10.

Fig. 10. Distributions of SLOC values across tasks and conditions. The horizontal dotted lines represent 25% and 75% quartiles, and the dashed lines represent medians.

Table 8 summarizes our default specification mixed-effects regressions with CC and SLOC variables included; the models with our second specification (de-meaned task experience) are shown in Appendix G. The models fit the data reasonably well (\( R^2_c = 50\% \) for SLOC, \( R^2_c = 27\% \) for CC).

Table 8.
Dependent variable
Completion timeCorrectness scoreSLOCCC
(1)(2)(3)(4)
Experience-195.620.07-0.62-0.21
(183.11)(0.24)(1.61)(0.46)
Uses plugin15.760.444.16\( ^{**} \)0.73
(196.11)(0.30)(1.91)(0.58)
Constant3,984.51\( ^{***} \)5.88\( ^{***} \)27.15\( ^{***} \)5.64\( ^{***} \)
(838.07)(1.03)(7.40)(1.95)
Observations224237237237
Num users31313131
Num tasks14141414
sd(user)1,489.250.826.161.18
sd(task)1,104.71.1412.652.33
R2m0.0040.0080.0110.006
R2c0.6420.2890.5020.27
Akaike Inf. Crit.3,987.141,106.662,002.421,417.27
Bayesian Inf. Crit.4,007.611,127.462,023.231,438.08
  • Note: \( ^{*} \)p \( \lt \) 0.1; \( ^{**} \)p \( \lt \) 0.05; \( ^{***} \)p \( \lt \) 0.01.

Table 8. LMER Task Performance Models (Default Specification, w/code Complexity Metrics)

  • Note: \( ^{*} \)p \( \lt \) 0.1; \( ^{**} \)p \( \lt \) 0.05; \( ^{***} \)p \( \lt \) 0.01.

Analyzing the models, we make the following observations: There is no statistically significant difference between the two conditions in cyclomatic complexity values (model (4)). That is, the code written by users in the plugin condition appears statistically indistinguishably as correct and as complex from the code written by users in the control group.

We note a small effect of using the plugin on code length (model (3)). On average, the code written by users in the plugin condition is ~4 source lines of code longer than the code written by users without using the plugin. However, this effect is quite small, smaller than the standard deviation of the random user intercept (~6 source lines of code).

F NL2CODE PLUGIN QUERY SYNTAX

For the best results from the code generation model, we also instruct the users to write queries as expected by the model with the following rules:

  • Quote variable names in the query with grave accent marks: ... variable_name...

  • Quote string literals with regular quotation marks: ... “Hello World!” ...

  • Example query 1: open a file “yourfile.txt” in write mode.

  • Example query 2: lowercase a string text and remove non-alphanumeric characters aside from space.

G TASK PERFORMANCE MODELS (DE-MEANED SPECIFICATION)

Table 9 summarizes our alternative specification (de-meaned task experience) mixed-effects regressions for two response variables in the main article, plug two response variables (CC and SLOC) introduced in Appendix E.

Table 9.
Dependent variable
Completion timeCorrectness scoreSLOCCC
(1)(2)(3)(4)
Experience BTW-478.55-0.04-1.470.04
(566.62)(0.43)(2.98)(0.74)
Experience WI-166.140.12-0.30-0.35
(191.33)(0.29)(1.87)(0.56)
Uses plugin14.470.444.15\( ^{**} \)0.74
(196.07)(0.30)(1.90)(0.58)
Constant5,142.42\( ^{**} \)6.32\( ^{***} \)30.59\( ^{**} \)4.62
(2,348.61)(1.77)(12.60)(3.07)
Observations224237237237
Num users31313131
Num tasks14141414
sd(user)1,482.320.816.151.17
sd(task)1,107.91.1312.692.32
R2m0.0120.0080.0120.007
R2c0.6430.2870.5040.269
Akaike Inf. Crit.3,988.861,108.562,004.301,419.09
Bayesian Inf. Crit.4,012.741,132.842,028.581,443.36
  • Note: \( ^{*} \)p \( \lt \) 0.1; \( ^{**} \)p \( \lt \) 0.05; \( ^{***} \)p \( \lt \) 0.01.

Table 9. LMER Task Performance Models (De-meaned Experience, w/code Complexity Metrics)

  • Note: \( ^{*} \)p \( \lt \) 0.1; \( ^{**} \)p \( \lt \) 0.05; \( ^{***} \)p \( \lt \) 0.01.

Table 10.
TaskQueries
T1-1callpick\( \_ \)with\( \_ \)replacementhow to generate random letter
create a dictionary with keysrandom\( \_ \)letters and valuesrandom\( \_ \)numbersimport library random
create dictionarylist to dict
create empty dictionaryloop on numbers from 0 to 100
create list ”a\( \_ \)list”loop over a range ofcount
defaultdictmerge 2 dictionaries
dictionary of characters and intpair characters incharactersand numbers innumbers
for loop on range 100printdickeys on each line
generat integers 1–20printdickeys sorted
generate 100 integers (1–20 inclusive).printdicsorted by keys
generate 100 random lower-cased letersprint a to z
generate 100 random lowercase lettersprint list
generate 100 random numbersprint list as string
generate 100 random numbers from 1 to 20print list elements
generate a rondom lower case characterprint without newline
generate char lower caserandom
generate dictrandom character between a and z
generate list of random charachtersrandom characters
generate lowercase charrandom integer between 1 and 20
generate randomrandom number
generate random between 0 and 20random sample with replacement
generate random charachterrandomly generate 100 letters
generate random intrandomly pick an item fromseq
generate random lettersrearrange dictionary keys into alphabetic order
generate random lower case letterssort a list
generate random nu,bersort a list into ascending order
generate random numbersort a list x into ascending order
generate random numberssort dict by key
generate random numbers between 1-20 inclusivesort key of dict
get a random lettersort list
given listlettersandintegers, create a dicitonary such that the values inlettersare keys and values inintegersare valuessort list ’values’ into ascending order
how to append value in dictsquence of integers from 1 to 20 inclusive
how to check if a key is in a dictionayzip 2 lists
how to generate random int in range between 1 and 20ziphundred\( \_ \)characterswithhundred\( \_ \)numbers
T1-2add a week to a datetimeget gmt timezone
add days to timeget now one week from now
assign current date and time tonowget the current date in utc
change date formatget the current time in utc
change datetime format ofweek\( \_ \)dateto mm-dd-yyyy hh:mmget the date and time a week from now in gmt
convertweek\( \_ \)dateto GMT timezone and assign toGMT\( \_ \)week\( \_ \)dateget time and date
convert date timezoneget time and date in gmt indate
date from 7 daysget time and date one week from now
date gmtget time now
date nowgmt
datetimegmt time 24
displayweek\( \_ \)datein format mm-dd-yyyy hh:mmimport datetime
format datetimeimport time
format datetime 24 hourmm-dd-yyyy
format timeprint current date time
get current datetimeprint date and time in GMT in 24hr format
get date 7 days from todayprint datetime in mm-dd-yyyy hh:mm format
get date and time in gmttime add
get date and time one week from nowtime and date
get date time one week from nowtime and date in certain
get datetimetimedelta
T2-1copy column from ”data.csv” file to another ”output.csv”new line
copy column from ”data.csv” to ”output.csv”number of columns of csv
create ’output.csv’ csv fileopen ”data.csv” file
csv writeopen a csv filedata.csvand read the data
csv writeropen csv
cvsopen csv filedata.csv
cvs filesopen csv file with read and write
delete a column in csvopen file
delete column from csvpandas read csv
delete column from csv filepandas read csv named ”data.csv”
delete first and last column in csv fileprint csv without row numbers
delete first and last column of dfpython make dir
delete first and last row from the dataframedfread ”data.csv” file
delete first row from dataframedfread csv file ”data.csv”
delete row in csvread csv file using pandas
delete the first column in csv file dfread csv pure python
file to csvread cvs
get current pathremove columns from csv file and save it to another csv file
get specific columns by index in pandas data frameremove first column from csv file
headers in a dataframesave df to a file output.csv in a new directory example\( \_ \)output
how to delete a column in a dataframe pythonsave dataframe to csv
how to delete columns in dataframesave pandas dataframe to a file
how to save a dataframe in csv filesave this dataframe to a csv
if dir existwriteoutputto csv file
if directory ”output” existswrite csvoutput\( \_ \)fto file ”output/output.csv”
make directorywrite output to csv file ”output.csv”
make directory ”output” if it doesn’t existwrite to csv file
T2-2change directorylist files in folder
change directory to ”data”list of filenames from a folder
check file encodingmove file to other directory
check if directory existsnormalize newlines to \( \textbackslash \)n
convert binary decoded string to asciiopen file
convert file encodingopen text file
convert file to utfread a file and iterate over its contents
convert latin-1 to utf-8read all files under a folder
convert str to utf-8read file
convert text file encodingread ISO-8859-15
convert text files from encoding ISO-8859-15 to encoding UTF-8.readline encoding
copy a fileredirect
copy fileremove header
copy fileddd.pngremove heading white space
copy file to other foldertext normalize newlines to \( \textbackslash \)n
covert file to utftraverse a directory
find charactertravverse list of files
get all files in directorytrim heading whitespace
get the file extensiontrim the heading and trailing whitespaces and blank lines for all text files
iterating files in a folderunkown encoding
list all text files in the data directorywrite to file
list files in directory
T3-1check iffileis a directorymatch regex year month day
check if string has specific patternmove file
copy a file to distmove files from directory to directory
copy all files and directories from one folder to anotherrecursive copy files and folders
copy directory to another directoryrecursively iterate over all files in a directory
copy directory to directoryregex dd-mm-yy
copy directory tree from source to destinationregex digit python
copy file fromsrc\( \_ \)pathtodest\( \_ \)pathregex for date
copy filesregex replace capture group
copy files and directories under data directoryregexp date
copy files creating directoryrename file
copy files from folderrename file with regex
create filerename files
create folderreplace pattern in string
datetime to stringsearch all matches in a string
extract year month day from string regexsearch for pattern ”%d%d-%d%d” infile
get all files and folderswalk all files in a directory
get the files that inside the folderswalk all nested files in the directory ”data”
list all filepaths in a directorywalke all files in a directory
make a folder recersivelywrite to file
T3-2add entry to json fileload json file
check if fileoutput\( \_ \)fileexistsload json from a file
check if file ends with .jsonread a json file namedf
convert dict to stringsorting a dictionary by key
convert list to dictionarywrite into txt file
import json parsing librarywrite json in ret to file outfile
T4-1find all bold text from html soupparse all hyperlinks from r using bs4
find all hrefs from soupvisit url and extract hrefs using bs4
find all red colored text from html soupvisit the given url url and extract all hrefs from there
go to a urlvisit the urlurl
how to get page urls beautifulsoup
T4-2create directoryregex []
download an image requestsave dict to csv
extract imafe from htmlsave table beautifulsoup
http reques get html
T5-1add json file to a listcheck email correctness
T5-2argparse subprogramprint format
exit programrequest with params
gET request to ”https://jsonplaceholder.typicode.com/posts” with argument userId
T6-1a list of dictionary to pandas dataframepandas change dataframe column name
add a new column to a dataframe rowpandas create buckets by column value
average by group pandaspandas dropnan
cast a float to two decimalspandas get average of column
cast a list to a dataframepandas group by
column to integer pandaspandas join dataframes
create a dataframe from a listpandas join series into dataframes
csvpandas output csv
csv writepandas print with two decimals
delete coloumn pdpandas read from csv
df set column to 7 decimalspandas round value
filter df with two conditionspandas save csv two decimal
filter values in pandas dfpandas to csv
find unique data from csvpandas to csv decimal
findallpandas write df to csv
floating data in csv group in digitpandas write to csv file
format output to 2 decimalpandas write to file decimal
get average of row values in pandas dataframeread csv
get average value from group of data in csvread csv file
get the head of dataframe dfremove repeated column in csv file
group by range pandasrename column pandas
group of data from csvrename pandas df columns
how to combine 2 lists into a dictionaryround a variable to 2dp
how to remove an item from a list using the indexsave compan\( \_ \)df dataframe to a file
import pandassave compand\( \_ \)df dataframe to a file
list to an entry in pandas dataframesort dataframejdfbyscores
load csv file with pandassort dataframejdfby the values of column ’scores’
loop files recursivesort pandas dataframe
newline spacestandard deviation from group of data in csv
pandas add new column based on row valuestwo deciaml place
pandas calculate meanwritefinal\( \_ \)datato csv file ”price.csv”
T6-2cross validation in scikit learnmultinomial logistic regression model
cross validation mean accuracynumpy load from csv
disable warningsrun 5-fold accuracy
how to determine cross validation mean in scikit learnset numpy random seed to 0
how to split dataset in scikit learnsklearn 5 fold cross validation
how to split dataset in scikit learnsklearn 5-fold cross validation
linear regressor 5 folder cross validationsklearn cross validation x, y for 5 folds
load wine datasetsklearn ignore warnings
T7-1how to choose plot size in inchesplt set x axis tick range
how to choose plot title in matplotlibplt set xtick font size
how to create ascatter plot using matplotlibreformat date
how to draw scatter plot for data in csv filesave plot as image
plt create figure with sizesave plt figure
plt date as x axisscatter
plt set x axis labelscatter plot purple
T7-2bar graph side by sideplot bar
bar plot with multiple bars per labelplot size
get height of bars in subplot bar gaphsplot title
get labels above bars in subplotsplt ax legend
group pandas df by two columnsplt ax xlabel
horizontal subplotplt create 3 subplots
import matplotlibplt set title for subplot figure
matplotlib grouped bar chartplt set x tick labels
matplotlib multiple histogramsplt show values on bar plot
matplotlib themepyplot subplots
pandas dataframe from csvselect row pandas
pandas dataframe groupby column
  • Queries for which the participant chose a snippet produced by the code generation model are shown in boldface, and in the remainder a retrieved snippet was used.

Table 10. Unique Successful User Queries to the NL2Code Plugin, Per Task, for the 31 Study Participants

  • Queries for which the participant chose a snippet produced by the code generation model are shown in boldface, and in the remainder a retrieved snippet was used.

Table 11.
TaskQueries
T1-1callpick\( \_ \)with\( \_ \)replacement\( \circ \)defaultdict
generate lowercase char \( \bullet \circ \)for loop on range 100 \( \bullet \circ \)
generate random between 0 and 20 \( \bullet \circ \)generate char lower case
random sample with replacement \( \bullet \circ \)generate random letters \( \bullet \circ \)
sort key of dict \( \bullet \circ \)random characters
T1-2change datetime format ofweek\( \_ \)date to mm-dd-yyyy hh:mm \( \bullet \circ \)format datetime
convertweek\( \_ \)date to GMT timezone and assign to GMT\( \_ \)week\( \_ \)date\( \bullet \circ \)get gmt timezone \( \circ \)
print datetime in mm-dd-yyyy hh:mm format \( \bullet \circ \)get now one week from now \( \bullet \circ \)
date now \( \bullet \circ \)get time and date \( \bullet \circ \)
T2-1remove first column from csv file \( \bullet \circ \)how to delete columns in dataframe \( \circ \)
csv writeropen ”data.csv” file \( \bullet \circ \)
how to delete a column in a dataframe python \( \circ \)
T2-2traverse a directory \( \circ \)
T3-1copy a file to dist \( \circ \)recursive copy files and folders \( \circ \)
match regex year month dayregexp date
T4-2download an image requestsave dict to csv
T5-2exit program \( \bullet \circ \)argparse subprogram
T6-1load csv file with pandas \( \bullet \circ \)how to remove an item from a list using the index \( \bullet \circ \)
pandas round value \( \circ \)pandas create buckets by column value
pandas to csvpandas group by
read csv file \( \bullet \circ \)pandas output csv \( \circ \)
rename column pandas \( \circ \)pandas to csv decimal \( \circ \)
filter df with two conditionspandas write df to csv
T6-2load wine dataset
T7-1plt create figure with size \( \bullet \circ \)scatter \( \circ \)
T7-2plt ax legend \( \circ \)plt create 3 subplots \( \bullet \circ \)
bar plot with multiple bars per label \( \circ \)
  • Queries for which the user chose a snippet from the code generation model are shown in boldface. \( \bullet \) denotes queries “good enough” on their own; \( \circ \) denotes queries good enough given the rest of the source file as context; the former is a strict subset of the latter.

Table 11. Sampled User Queries for the Oracle Analysis

  • Queries for which the user chose a snippet from the code generation model are shown in boldface. \( \bullet \) denotes queries “good enough” on their own; \( \circ \) denotes queries good enough given the rest of the source file as context; the former is a strict subset of the latter.

H USER QUERIES

I RANDOMLY SAMPLED USER QUERIES FOR THE ORACLE ANALYSIS

ACKNOWLEDGMENTS

We thank William Qian, who was involved in development of an early version of the plugin. We thank all participants who took part in the user study experiments for their effort on completing the tasks testing the intelligent programming interface. We would like to give special thanks to Ziyu Yao and NeuLab members Shuyan Zhou, Zecong Hu, among others, for the early testing of the plugin and the user study and their valuable feedback. We also thank anonymous reviewers for their comments on revising this article.

Footnotes

  1. 1 https://stackoverflow.com/q/82831.

    Footnote
  2. 2 https://stackoverflow.com/q/38987.

    Footnote
  3. 3 https://www.jetbrains.com/pycharm/.

    Footnote
  4. 4 At https://github.com/neulab/tranX-plugin.

    Footnote
  5. 5 https://github.com/neulab/external-knowledge-codegen.

    Footnote
  6. 6 We deployed the model on an internal research server and exposed a HTTP API that the plugin can access; queries are fast enough for the plugin to be usable in real time.

    Footnote
  7. 7 https://www.bing.com/.

    Footnote
  8. 8 We chose Bing rather than other alternatives such as Google due to the availability of an easily accessible search API.

    Footnote
  9. 9 https://stackoverflow.com/.

    Footnote
  10. 10 To mitigate concerns that user queries using the specified syntax (command form sentences and including variable names) may adversely affect the retrieval results, after the full study was complete, we modified 59 user-issued queries that were indeed complete sentences with full variable names, converting them into short phrases without variable names and re-ran the retrieval. We then compared the results and manually annotated the number of times the search engine returned a result that we judged was sufficient to understand how to perform the programming task specified by the user’s intent. As a result, the user-written full intent resulted in a sufficient answer 34/59 times, and the simplified intent without variable names returned a sufficient answer 36/59 times, so it appears that including variable names has a marginal to no effect on whether the search engine was able to provide a good top-1 result. We also measured the exact-match overlap between the top-1 results and found it to be 22/59, and overlap between the top-7 result lists was 182/(59*7).

    Footnote
  11. 11 Note the special syntax used to mark explicit variables; see Appendix F for full syntax details.

    Footnote
  12. 12 We note that the main motivation for this ordering is that the generation results tend to be significantly more concise than the retrieval results (Figure 6). If we put the retrieval results first, then it is likely that the users would rarely scroll past the retrieval results and view the generation results due to issues of screen real-estate. It is important to consider that alternative orderings may result in different experimental results, although examining alternate orderings was not feasible within the scope of the current study.

    Footnote
  13. 13 The edit data may also be helpful as training data for improving code generation and retrieval models. We release our data publicly to encourage this direction in future work.

    Footnote
  14. 14 https://www.udacity.com/courses/all.

    Footnote
  15. 15 https://www.codecademy.com/catalog.

    Footnote
  16. 16 https://www.coursera.org/.

    Footnote
  17. 17 Corresponding to the search https://stackoverflow.com/search?tab=votes&q=python%20matplotlib.

    Footnote
  18. 18 https://stackoverflow.com/questions/332289/how-do-you-change-the-size-of-figures-drawn-with-matplotlib.

    Footnote
  19. 19 https://stackoverflow.com/questions/4700614/how-to-put-the-legend-out-of-the-plot.

    Footnote
  20. 20 https://stackoverflow.com/questions/9622163/save-plot-to-image-file-instead-of-displaying-it-using-matplotlib.

    Footnote
  21. 21 https://stackoverflow.com/questions/12444716/how-do-i-set-the-figure-title-and-axes-labels-font-size-in-matplotlib.

    Footnote
  22. 22 https://www.upwork.com/.

    Footnote
  23. 23 The task identifiers in Table 2 reflect this order.

    Footnote
  24. 24 Despite these instructions, some participants did not use the plugin even when it was available and when instructed. We discovered this while analyzing the data collected from the study and filtered out 8 participants that did not use the plugin at all. They do not count towards the final sample of 31 participants we analyze data from, even though they completed tasks.

    Footnote
  25. 25 Note that 4 of the 31 participants did not complete all 8 of their assigned tasks. We include their data from the tasks they completed and do not consider the tasks they did not finish.

    Footnote
  26. 26 https://www.jetbrains.com/pycharm/download/.

    Footnote
  27. 27 https://github.com/neulab/tranx-study/blob/master/rubrics.md.

    Footnote
  28. 28 We are using the R syntax to specify random effects.

    Footnote
  29. 29 We also experimented with other features, e.g., query length, query format compliance, and so on, but did not notice a significant difference in prediction accuracy.

    Footnote
  30. 30 Note that this only considers exact substring matches. There may be additional instances of functionally equivalent code that is nonetheless not an exact match.

    Footnote
  31. 31 The former implies the latter but not vice versa.

    Footnote
  32. 32 Note that on the surface, when looking at the data in Table 11, the values of the former two binary variables (the oracle’s determination) may not always seem intuitive given the query. For example, the oracle determined the query “pandas to csv” to be not good enough, even with context, while the query “pandas output csv,” seemingly equivalent, was found to be good enough with context. In both cases, the intent appears to be exporting a pandas dataframe (a popular data science Python library) as a csv file. However, in the first example the snapshot of the source file the study participant was working in at the time of the query did not yet include any such dataframe objects; the user appears to have issued the query ahead of setting up the rest of the context. A context-aware code generation model would also not be able to extract any additional information in this case, similarly to the human oracle.

    Footnote
  33. 33 https://docs.python.org/3/library/tokenize.html.

    Footnote
  34. 34 Three of the retrieved snippets cannot be parsed and thus are omitted. See full explanation of different token types at https://www.asmeurer.com/brown-water-python/tokens.html. We also left out some uninteresting token types, such as ENCODING, ENDMARKER, NL.

    Footnote
  35. 35 This is, of course, among the many other use cases for neural network models of code and natural language such as code summarization [48, 121] or embedding models that represent programming languages together with natural languages [30]. Allamanis et al. [3] provide a comprehensive survey of the use cases of machine learning models in this area.

    Footnote
  36. 36 https://docs.microsoft.com/en-us/visualstudio/ide/using-intellisense.

    Footnote
  37. 37 https://visualstudio.microsoft.com/services/intellicode.

    Footnote
  38. 38 https://www.codota.com/.

    Footnote
  39. 39 https://www.tabnine.com/.

    Footnote
  40. 40 https://releases.ubuntu.com/18.04/.

  41. 41 https://www.xfce.org/.

  42. 42 https://www.virtualbox.org/wiki/Downloads.

  43. 43 https://www.vagrantup.com/.

  44. 44 https://www.python.org/.

  45. 45 https://mitmproxy.org/.

  46. 46 https://www.mozilla.org/en-US/firefox/.

  47. 47 https://github.com/rubik/radon.

REFERENCES

  1. [1] Agashe R., Iyer Srini, and Zettlemoyer Luke. 2019. JuICe: A large scale distantly supervised dataset for open domain context-based code generation. In Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing (EMNLP/IJCNLP).Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Allamanis Miltiadis, Barr Earl T., Bird Christian, and Sutton Charles. 2014. Learning natural coding conventions. In International Symposium on Foundations of Software Engineering (ESEC/FSE). 281293.Google ScholarGoogle Scholar
  3. [3] Allamanis Miltiadis, Barr Earl T., Devanbu Premkumar, and Sutton Charles. 2018. A survey of machine learning for big code and naturalness. ACM Comput. Surv. 51, 4 (2018), 137.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. [4] Allamanis Miltiadis, Tarlow Daniel, Gordon A., and Wei Y.. 2015. Bimodal modelling of source code and natural language. In 32nd International Conference on Machine Learning (ICML).Google ScholarGoogle Scholar
  5. [5] Amann S., Proksch Sebastian, and Nadi S.. 2016. FeedBaG: An interaction tracker for Visual Studio. In International Conference on Program Comprehension (ICPC). 13.Google ScholarGoogle Scholar
  6. [6] Amann Sven, Proksch Sebastian, Nadi Sarah, and Mezini Mira. 2016. A study of visual studio usage in practice. In IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER). IEEE, 124134.Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Arens Yigal, Knoblock Craig A., and Shen Wei-Min. 1996. Query reformulation for dynamic information integration. J. Intell. Inf. Syst. 6, 2–3 (1996), 99130.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] Arthur Philip, Neubig Graham, Sakti Sakriani, Toda Tomoki, and Nakamura Satoshi. 2015. Semantic parsing of ambiguous input through paraphrasing and verification. Trans. Assoc. Comput. Ling. 3 (2015), 571584.Google ScholarGoogle Scholar
  9. [9] Bacchelli Alberto, Ponzanelli Luca, and Lanza Michele. 2012. Harnessing stack overflow for the IDE. In International Workshop on Recommendation Systems for Software Engineering (RSSE). IEEE, 2630.Google ScholarGoogle Scholar
  10. [10] Balog Matej, Gaunt Alexander L., Brockschmidt Marc, Nowozin Sebastian, and Tarlow Daniel. 2017. DeepCoder: Learning to write programs. In 5th International Conference on Learning Representations (ICLR).Google ScholarGoogle Scholar
  11. [11] Barman S., Chasins Sarah E., Bodík Rastislav, and Gulwani Sumit. 2016. Ringer: Web automation by demonstration. In ACM SIGPLAN International Conference on Object-oriented Programming, Systems, Languages, and Applications.Google ScholarGoogle Scholar
  12. [12] Basin D., Deville Y., Flener P., Hamfelt A., and Nilsson Jørgen Fischer. 2004. Synthesis of programs in computational logic. In Program Development in Computational Logic.Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Bell Andrew, Fairbrother Malcolm, and Jones Kelvyn. 2019. Fixed and random effects models: Making an informed choice. Qual. Quant. 53, 2 (2019), 10511074.Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Beltramelli Tony. 2018. pix2code: Generating code from a graphical user interface screenshot. In ACM SIGCHI Symposium on Engineering Interactive Computing Systems. ACM, 3:1–3:6. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Berant Jonathan, Chou Andrew, Frostig Roy, and Liang Percy. 2013. Semantic parsing on freebase from question-answer pairs. In Conference on Empirical Methods in Natural Language Processing (EMNLP). 15331544.Google ScholarGoogle Scholar
  16. [16] Brandt J., Guo P., Lewenstein J., Dontcheva Mira, and Klemmer Scott R.. 2009. Two studies of opportunistic programming: Interleaving web foraging, learning, and writing code. In SIGCHI Conference on Human Factors in Computing Systems (CHI).Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Campbell Brock Angus and Treude Christoph. 2017. NLP2Code: Code snippet content assist via natural language tasks. In International Conference on Software Maintenance and Evolution (ICSME). IEEE, 628632.Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Cateté Veronica and Barnes T.. 2017. Application of the delphi method in computer science principles rubric creation. InACM Conference on Innovation and Technology in Computer Science Education.Google ScholarGoogle Scholar
  19. [19] Chasins Sarah E., Barman S., Bodík Rastislav, and Gulwani Sumit. 2015. Browser record and replay as a building block for end-user web automation tools. In 24th International Conference on World Wide Web (WWW).Google ScholarGoogle Scholar
  20. [20] Chasins Sarah E., Mueller Maria, and Bodík Rastislav. 2018. Rousillon: Scraping distributed hierarchical web data. In 31st Annual ACM Symposium on User Interface Software and Technology (UIST).Google ScholarGoogle Scholar
  21. [21] Chen X., Liu C., and Song D.. 2019. Execution-guided neural program synthesis. In 7th International Conference on Learning Representations (ICLR).Google ScholarGoogle Scholar
  22. [22] Cohen J.. 2003. Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences. Lawrence Erlbaum.Google ScholarGoogle Scholar
  23. [23] Cramér Harald. 1999. Mathematical Methods of Statistics. Vol. 43. Princeton University Press.Google ScholarGoogle Scholar
  24. [24] Cypher A., Halbert Daniel C., Kurlander D., Lieberman H., Maulsby D., Myers B., and Turransky Alan. 1993. Watch what I do: Programming by demonstration.Google ScholarGoogle Scholar
  25. [25] Dawood M., Buragga Khalid A., Khan Abdul Raouf, and Zaman Noor. 2013. Rubric based assessment plan implementation for computer science program: A practical approach. In IEEE International Conference on Teaching, Assessment and Learning for Engineering (TALE). 551555.Google ScholarGoogle Scholar
  26. [26] Dijkstra Edsger W.. 1979. On the foolishness of “natural language programming.” In Program Construction. Springer, 5153.Google ScholarGoogle Scholar
  27. [27] Ellis K., Nye Maxwell, Pu Y., Sosa Felix, Tenenbaum J., and Solar-Lezama Armando. 2019. Write, execute, assess: Program synthesis with a REPL. In 33rd Conference on Neural Information Processing Systems (NeurIPS).Google ScholarGoogle Scholar
  28. [28] Fedus William, Zoph Barret, and Shazeer Noam. 2021. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. arXiv preprint arXiv:2101.03961 (2021).Google ScholarGoogle Scholar
  29. [29] Feng Y., Martins R., Bastani Osbert, and Dillig Isil. 2018. Program synthesis using conflict-driven learning. In 39th ACM SIGPLAN Conference on Programming Language Design and Implementation.Google ScholarGoogle Scholar
  30. [30] Feng Zhangyin, Guo Daya, Tang Duyu, Duan N., Feng X., Gong Ming, Shou Linjun, Qin B., Liu Ting, Jiang Daxin, and Zhou M.. 2020. CodeBERT: A pre-trained model for programming and natural languages. In Conference on Empirical Methods in Natural Language Processing (EMNLP).Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Feser John K., Chaudhuri S., and Dillig Isil. 2015. Synthesizing data structure transformations from input-output examples. In 36th Annual ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI).Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. [32] Franks Christine, Tu Zhaopeng, Devanbu Premkumar, and Hellendoorn Vincent. 2015. CACHECA: A cache language model based code suggestion tool. In International Conference on Software Engineering (ICSE). IEEE, 705708.Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Fraser Gordon, Staats Matt, McMinn Phil, Arcuri Andrea, and Padberg Frank. 2015. Does automated unit test generation really help software testers? A controlled empirical study. ACM Trans. Softw. Eng. Methodol. 24, 4 (2015), 149.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. [34] Gelman Andrew and Hill Jennifer. 2006. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press.Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Ginsparg J.. 1978. Natural language processing in an automatic programming domain.Google ScholarGoogle Scholar
  36. [36] Grover Shuchi, Basu S., and Schank Patricia K.. 2018. What we can learn about student learning from open-ended programming projects in middle school computer science. In 49th ACM Technical Symposium on Computer Science Education.Google ScholarGoogle Scholar
  37. [37] Gu Xiaodong, Zhang Hongyu, and Kim Sunghun. 2018. Deep code search. In IEEE/ACM 40th International Conference on Software Engineering (ICSE). IEEE, 933944.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. [38] Gulwani Sumit. 2011. Automating string processing in spreadsheets using input-output examples. ACM SIGPLAN Not. 46, 1 (2011), 317330.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. [39] Haiduc Sonia, Bavota G., Marcus A., Oliveto R., Lucia A., and Menzies T.. 2013. Automatic query reformulations for text retrieval in software engineering. In 35th International Conference on Software Engineering (ICSE). 842851.Google ScholarGoogle Scholar
  40. [40] Hashimoto Tatsunori B., Guu Kelvin, Oren Yonatan, and Liang Percy S.. 2018. A retrieve-and-edit framework for predicting structured outputs. In Conference on Advances in Neural Information Processing Systems (NeurIPS). 1005210062.Google ScholarGoogle Scholar
  41. [41] Hayati Shirley Anugrah, Olivier Raphael, Avvaru Pravalika, Yin Pengcheng, Tomasic Anthony, and Neubig Graham. 2018. Retrieval-based neural code generation. In Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 925930. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  42. [42] Head Andrew, Glassman Elena Leah, Hartmann B., and Hearst Marti A.. 2018. Interactive extraction of examples from existing code. In CHI Conference on Human Factors in Computing Systems.Google ScholarGoogle Scholar
  43. [43] Head Andrew, Glassman Elena Leah, Soares Gustavo, Suzuki R., Figueredo Lucas, D’Antoni L., and Hartmann B.. 2017. Writing reusable code feedback at scale with mixed-initiative program synthesis. In 4th ACM Conference on Learning @ Scale.Google ScholarGoogle Scholar
  44. [44] Heidorn George E.. 1976. Automatic programming through natural language dialogue: A survey. IBM J. Res. Devel. 20, 4 (1976), 302313.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. [45] Hill E., Roldan-Vega Manuel, Fails J., and Mallet Greg. 2014. NL-based query refinement and contextualized code search results: A user study. In Software Evolution Week - IEEE Conference on Software Maintenance, Reengineering, and Reverse Engineering (CSMR-WCRE). 3443.Google ScholarGoogle ScholarCross RefCross Ref
  46. [46] Jr. Joseph L. Hodges and Lehmann Erich L.. 1963. Estimates of location based on rank tests. Ann. Math. Statist. (1963), 598611.Google ScholarGoogle Scholar
  47. [47] Husain Hamel, Wu Ho-Hsiang, Gazit Tiferet, Allamanis Miltiadis, and Brockschmidt Marc. 2019. CodeSearchNet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436 (2019).Google ScholarGoogle Scholar
  48. [48] Iyer Srini, Konstas Ioannis, Cheung A., and Zettlemoyer Luke. 2016. Summarizing source code using a neural attention model. In 54th Annual Meeting of the Association for Computational Linguistics (ACL).Google ScholarGoogle Scholar
  49. [49] Iyer Srini, Konstas Ioannis, Cheung A., and Zettlemoyer Luke. 2018. Mapping language to code in programmatic context. In Conference on Empirical Methods in Natural Language Processing (EMNLP).Google ScholarGoogle ScholarCross RefCross Ref
  50. [50] Johnson Paul C. D.. 2014. Extension of Nakagawa & Schielzeth’s \( R^2_{GLMM} \) to random slopes models. Meth. Ecol. Evolut. 5, 9 (2014), 944946.Google ScholarGoogle ScholarCross RefCross Ref
  51. [51] Karamcheti Siddharth, Sadigh Dorsa, and Liang Percy. 2020. Learning adaptive language interfaces through decomposition. In 1st Workshop on Interactive and Executable Semantic Parsing. Association for Computational Linguistics, 2333. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  52. [52] Keivanloo I., Rilling J., and Zou Ying. 2014. Spotting working code examples. In 36th International Conference on Software Engineering (ICSE).Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. [53] Kery Mary Beth, Horvath Amber, and Myers B.. 2017. Variolite: Supporting exploratory programming by data scientists. CHI Conference on Human Factors in Computing Systems (CHI).Google ScholarGoogle Scholar
  54. [54] Kery Mary Beth and Myers B.. 2017. Exploring exploratory programming. In IEEE Symposium on Visual Languages and Human-centric Computing (VL/HCC). 2529.Google ScholarGoogle ScholarCross RefCross Ref
  55. [55] Ko A. and Myers B.. 2004. Designing the whyline: A debugging interface for asking questions about program behavior. In CHI Conference on Human Factors in Computing Systems (CHI).Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. [56] Ko A. and Myers B.. 2008. Debugging reinvented. In ACM/IEEE 30th International Conference on Software Engineering (ICSE). 301310.Google ScholarGoogle Scholar
  57. [57] Ko Amy, Myers Brad A., and Aung Htet Htet. 2004. Six learning barriers in end-user programming systems. In IEEE Symposium on Visual Languages and Human-centric Computing (VL/HCC). IEEE, 199206.Google ScholarGoogle Scholar
  58. [58] Kock Ned and Lynn Gary. 2012. Lateral collinearity and misleading results in variance-based SEM: An illustration and recommendations. J. Assoc. Inf. Syst. 13, 7 (2012).Google ScholarGoogle Scholar
  59. [59] Kulal S., Pasupat Panupong, Chandra K., Lee Mina, Padon Oded, Aiken A., and Liang Percy. 2019. SPoC: Search-based pseudocode to code. In 33rd Conference on Neural Information Processing Systems (NeurIPS).Google ScholarGoogle Scholar
  60. [60] Kushman Nate and Barzilay R.. 2013. Using semantic unification to generate regular expressions from natural language. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (HLT-NAACL).Google ScholarGoogle Scholar
  61. [61] Landman Davy, Serebrenik Alexander, Bouwers Eric, and Vinju Jurgen J.. 2016. Empirical analysis of the relationship between CC and SLOC in a large corpus of Java methods and C functions. J. Softw. Evolut. Process 28, 7 (2016), 589618.Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. [62] Le Vu and Gulwani Sumit. 2014. FlashExtract: A framework for data extraction by examples. ACM SIGPLAN Not. 49, 6 (2014), 542553.Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. [63] Lei Tao, Long F., Barzilay R., and Rinard M.. 2013. From natural language specifications to program input parsers. In 51st Annual Meeting of the Association for Computational Linguistics (ACL).Google ScholarGoogle Scholar
  64. [64] Li Toby Jia-Jun, Azaria Amos, and Myers B.. 2017. SUGILITE: Creating multimodal smartphone automation by demonstration. In CHI Conference on Human Factors in Computing Systems (CHI).Google ScholarGoogle Scholar
  65. [65] Li Toby Jia-Jun, Labutov I., Li X., Zhang X., Shi W., Ding Wanling, Mitchell Tom Michael, and Myers B.. 2018. APPINITE: A multi-modal interface for specifying data descriptions in programming by demonstration using natural language instructions. In IEEE Symposium on Visual Languages and Human-centric Computing (VL/HCC). 105114.Google ScholarGoogle ScholarCross RefCross Ref
  66. [66] Li Toby Jia-Jun, Radensky Marissa, Jia J., Singarajah Kirielle, Mitchell Tom Michael, and Myers B.. 2019. PUMICE: A multi-modal agent that learns concepts and conditionals from natural language and demonstrations. In 32nd Annual ACM Symposium on User Interface Software and Technology (UIST).Google ScholarGoogle Scholar
  67. [67] Lieberman H., Paternò F., Klann Markus, and Wulf V.. 2006. End-user development: An emerging paradigm. In End User Development.Google ScholarGoogle ScholarCross RefCross Ref
  68. [68] Ling Wang, Blunsom Phil, Grefenstette Edward, Hermann Karl Moritz, Kociský Tomás, Wang Fumin, and Senior Andrew W.. 2016. Latent predictor networks for code generation. In 54th Annual Meeting of the Association for Computational Linguistics (ACL). The Association for Computer Linguistics. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  69. [69] Liu C., Xia Xin, Lo David, Gao Cuiyun, Yang Xiaohu, and Grundy J.. 2020. Opportunities and challenges in code search tools. ArXiv abs/2011.02297 (2020).Google ScholarGoogle Scholar
  70. [70] Liu X., Shen Beijun, Zhong H., and Zhu Jiangang. 2016. EXPSOL: Recommending online threads for exception-related bug reports. In 23rd Asia-Pacific Software Engineering Conference (APSEC). 2532.Google ScholarGoogle ScholarCross RefCross Ref
  71. [71] Lu Meili, Sun Xiaobing, Wang S., Lo D., and Duan Yucong. 2015. Query expansion via WordNet for effective code search. In International Conference on Software Analysis, Evolution, and Reengineering (SANER). IEEE, 545549.Google ScholarGoogle Scholar
  72. [72] Maloney J., Resnick M., Rusk N., Silverman B., and Eastmond Evelyn. 2010. The scratch programming language and environment. ACM Trans. Comput. Educ. 10 (2010), 16:1–16:15.Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. [73] Manning Christopher D., Schütze Hinrich, and Raghavan Prabhakar. 2008. Introduction to Information Retrieval. Cambridge University Press.Google ScholarGoogle ScholarCross RefCross Ref
  74. [74] Manshadi Mehdi, Gildea Daniel, and Allen James F.. 2013. Integrating programming by example and natural language programming. In AAAI Conference on Artificial Intelligence (AAAI).Google ScholarGoogle ScholarCross RefCross Ref
  75. [75] McCabe T.. 1976. A complexity measure. IEEE Trans. Softw. Eng. SE-2 (1976), 308320.Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. [76] Mihalcea Rada, Liu Hugo, and Lieberman Henry. 2006. NLP (natural language processing) for NLP (natural language programming). In International Conference on Intelligent Text Processing and Computational Linguistics. Springer, 319330.Google ScholarGoogle ScholarDigital LibraryDigital Library
  77. [77] Mohagheghi Parastoo and Conradi Reidar. 2007. Quality, productivity and economic benefits of software reuse: A review of industrial studies. Empir. Softw. Eng. 12, 5 (2007), 471516.Google ScholarGoogle ScholarDigital LibraryDigital Library
  78. [78] Moreno Laura, Bavota Gabriele, Penta Massimiliano Di, Oliveto Rocco, and Marcus Andrian. 2015. How can I use this method? In IEEE/ACM 37th IEEE International Conference on Software Engineering. IEEE, 880890.Google ScholarGoogle ScholarCross RefCross Ref
  79. [79] Mundlak Yair. 1978. On the pooling of time series and cross section data. Economet.: J. Economet. Societ. (1978), 6985.Google ScholarGoogle ScholarCross RefCross Ref
  80. [80] Murphy Lauren, Kery Mary Beth, Alliyu Oluwatosin, Macvean Andrew, and Myers Brad A.. 2018. API designers in the field: Design practices and challenges for creating usable APIs. In IEEE Symposium on Visual Languages and Human-centric Computing (VL/HCC). IEEE, 249258.Google ScholarGoogle Scholar
  81. [81] Myers B., Pane J., and Ko A.. 2004. Natural programming languages and environments. Commun. ACM 47 (2004), 4752.Google ScholarGoogle ScholarDigital LibraryDigital Library
  82. [82] Myers B. and Stylos Jeffrey. 2016. Improving API usability. Commun. ACM 59 (2016), 6269.Google ScholarGoogle ScholarDigital LibraryDigital Library
  83. [83] Myers Brad A., Ko Amy, LaToza Thomas D., and Yoon YoungSeok. 2016. Programmers are users too: Human-centered methods for improving programming tools. Computer 49, 7 (2016), 4452.Google ScholarGoogle ScholarDigital LibraryDigital Library
  84. [84] Myers Brad A. and Stylos Jeffrey. 2016. Improving API usability. Commun. ACM 59, 6 (2016), 6269.Google ScholarGoogle ScholarDigital LibraryDigital Library
  85. [85] Nakagawa Shinichi and Schielzeth Holger. 2013. A general and simple method for obtaining R2 from generalized linear mixed-effects models. Meth. Ecol. Evolut. 4, 2 (2013), 133142.Google ScholarGoogle ScholarCross RefCross Ref
  86. [86] Nam Daye, Horvath Amber, Macvean Andrew, Myers Brad, and Vasilescu Bogdan. 2019. Marble: Mining for boilerplate code to identify API usability problems. In International Conference on Automated Software Engineering (ASE). IEEE, 615627.Google ScholarGoogle ScholarDigital LibraryDigital Library
  87. [87] Nguyen T. and Csallner C.. 2015. Reverse engineering mobile application user interfaces with REMAUI (T). In 30th IEEE/ACM International Conference on Automated Software Engineering (ASE). 248259.Google ScholarGoogle ScholarDigital LibraryDigital Library
  88. [88] Nowell Lorelli S., Norris Jill M., White Deborah E., and Moules Nancy J.. 2017. Thematic analysis: Striving to meet the trustworthiness criteria. Int. J. Qualit. Meth. 16, 1 (2017), 1609406917733847.Google ScholarGoogle ScholarCross RefCross Ref
  89. [89] Papineni Kishore, Roukos Salim, Ward Todd, and Zhu Wei-Jing. 2002. Bleu: A method for automatic evaluation of machine translation. In 40th Annual Meeting of the Association for Computational Linguistics (ACL). Association for Computational Linguistics, 311318. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  90. [90] Parisotto Emilio, Mohamed Abdel Rahman, Singh R., Li L., Zhou Dengyong, and Kohli Pushmeet. 2017. Neuro-symbolic program synthesis. In 5th International Conference on Learning Representations (ICLR).Google ScholarGoogle Scholar
  91. [91] Ponzanelli Luca, Bacchelli Alberto, and Lanza Michele. 2013. Seahawk: Stack overflow in the IDE. In International Conference on Software Engineering (ICSE). IEEE, 12951298.Google ScholarGoogle ScholarCross RefCross Ref
  92. [92] Ponzanelli Luca, Bavota G., Penta M. D., Oliveto R., and Lanza M.. 2014. Mining Stack Overflow to turn the IDE into a self-confident programming prompter. In International Conference on Mining Software Repositories (MSR).Google ScholarGoogle ScholarDigital LibraryDigital Library
  93. [93] Price David, Rilofff Ellen, Zachary Joseph, and Harvey Brandon. 2000. NaturalJava: A natural language interface for programming in Java. In International Conference on Intelligent User Interfaces (IUI). 207211.Google ScholarGoogle ScholarDigital LibraryDigital Library
  94. [94] Proksch Sebastian, Amann Sven, and Nadi Sarah. 2018. Enriched event streams: A general dataset for empirical studies on in-IDE activities of software developers. In 15th International Conference on Mining Software Repositories (MSR). 6265.Google ScholarGoogle ScholarDigital LibraryDigital Library
  95. [95] Radhakrishnan Karthik, Srikantan Arvind, and Lin Xi Victoria. 2020. ColloQL: Robust Text-to-SQL over search queries. In 1st Workshop on Interactive and Executable Semantic Parsing. 3445.Google ScholarGoogle Scholar
  96. [96] Raghothaman Mukund, Wei Y., and Hamadi Y.. 2016. SWIM: Synthesizing what I mean—Code search and idiomatic snippet synthesis. In IEEE/ACM 38th International Conference on Software Engineering (ICSE). 357367.Google ScholarGoogle Scholar
  97. [97] Rahman M. M. and Roy C.. 2014. SurfClipse: Context-aware meta-search in the IDE. In IEEE International Conference on Software Maintenance and Evolution. 617620.Google ScholarGoogle ScholarDigital LibraryDigital Library
  98. [98] Rahman Mohammad Masudur, Yeasmin Shamima, and Roy Chanchal K.. 2014. Towards a context-aware IDE-based meta search engine for recommendation about programming errors and exceptions. In International Conference on Software Analysis, Evolution, and Reengineering (SANER). IEEE, 194203.Google ScholarGoogle ScholarCross RefCross Ref
  99. [99] Raychev Veselin, Vechev Martin, and Yahav Eran. 2014. Code completion with statistical language models. In ACM Conference on Programming Language Design and Implementation (PLDI). ACM, 419428.Google ScholarGoogle ScholarDigital LibraryDigital Library
  100. [100] Raza Mohammad, Gulwani Sumit, and Milic-Frayling Natasa. 2015. Compositional program synthesis from natural language and examples. In 24th International Joint Conference on Artificial Intelligence (IJCAI).Google ScholarGoogle ScholarDigital LibraryDigital Library
  101. [101] Rice Henry Gordon. 1953. Classes of recursively enumerable sets and their decision problems. Trans. Amer. Math. Soc. 74, 2 (1953), 358366.Google ScholarGoogle ScholarCross RefCross Ref
  102. [102] Richardson Matthew, Dominowska Ewa, and Ragno Robert. 2007. Predicting clicks: Estimating the click-through rate for new ads. In 16th International Conference on World Wide Web. 521530.Google ScholarGoogle ScholarDigital LibraryDigital Library
  103. [103] Roy Devjeet, Zhang Ziyi, Ma Maggie, Arnaoudova Venera, Panichella Annibale, Panichella Sebastiano, Gonzalez Danielle, and Mirakhorli Mehdi. 2020. DeepTC-Enhancer: Improving the readability of automatically generated tests. In 35th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 287298.Google ScholarGoogle ScholarDigital LibraryDigital Library
  104. [104] Sadowski Caitlin, Stolee Kathryn T., and Elbaum Sebastian. 2015. How developers search for code: A case study. In 10th Joint Meeting on Foundations of Software Engineering (ESEC/FSE). 191201.Google ScholarGoogle Scholar
  105. [105] Sahay Apurvanand, Indamutsa Arsene, Ruscio D. D., and Pierantonio A.. 2020. Supporting the understanding and comparison of low-code development platforms. In 46th Euromicro Conference on Software Engineering and Advanced Applications (SEAA). 171178.Google ScholarGoogle Scholar
  106. [106] Shin Richard, Allamanis Miltiadis, Brockschmidt Marc, and Polozov Oleksandr. 2019. Program synthesis and semantic parsing with learned code idioms. In 33rd Conference on Neural Information Processing Systems (NeurIPS).Google ScholarGoogle Scholar
  107. [107] Shull Forrest, Singer Janice, and Sjøberg Dag I. K.. 2007. Guide to Advanced Empirical Software Engineering. Springer.Google ScholarGoogle ScholarCross RefCross Ref
  108. [108] Solar-Lezama Armando. 2008. Program synthesis by sketching.Google ScholarGoogle Scholar
  109. [109] Subramanian Siddharth, Inozemtseva Laura, and Holmes Reid. 2014. Live API documentation. In International Conference on Software Engineering (ICSE).Google ScholarGoogle Scholar
  110. [110] Tran Ngoc, Tran Hieu, Nguyen Son, Nguyen Hoan, and Nguyen Tien. 2019. Does BLEU score work for code migration? In IEEE/ACM 27th International Conference on Program Comprehension (ICPC). IEEE, 165176.Google ScholarGoogle ScholarDigital LibraryDigital Library
  111. [111] Tu Zhaopeng, Su Zhendong, and Devanbu Premkumar. 2014. On the localness of software. In International Symposium on Foundations of Software Engineering (ESEC/FSE). ACM, 269280.Google ScholarGoogle Scholar
  112. [112] Vadas David and Curran James R.. 2005. Programming with unrestricted natural language. In Australasian Language Technology Workshop. 191199.Google ScholarGoogle Scholar
  113. [113] Vinayakarao Venkatesh, Sarma A., Purandare R., Jain Shuktika, and Jain Saumya. 2017. ANNE: Improving source code search using entity retrieval approach. In Web Search and Data Mining Conference (WSDM).Google ScholarGoogle ScholarDigital LibraryDigital Library
  114. [114] Wang Yushi, Berant Jonathan, and Liang Percy. 2015. Building a semantic parser overnight. In 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (ACL-IJCNLP). 13321342.Google ScholarGoogle ScholarCross RefCross Ref
  115. [115] Wei Yi, Chandrasekaran Nirupama, Gulwani Sumit, and Hamadi Youssef. 2015. Building Bing Developer Assistant. Technical Report. MSR-TR-2015-36, Microsoft Research.Google ScholarGoogle Scholar
  116. [116] Wohlin Claes, Runeson Per, Höst Martin, Ohlsson Magnus C., Regnell Björn, and Wesslén Anders. 2012. Experimentation in Software Engineering. Springer Science & Business Media.Google ScholarGoogle ScholarCross RefCross Ref
  117. [117] Xu Frank F., Jiang Zhengbao, Yin Pengcheng, Vasilescu Bogdan, and Neubig Graham. 2020. Incorporating external knowledge through pre-training for natural language to code generation. In Annual Meeting of the Association for Computational Linguistics (ACL). Association for Computational Linguistics, 60456052.Google ScholarGoogle Scholar
  118. [118] Yao Xuchen and Durme Benjamin Van. 2014. Information extraction over structured data: Question answering with freebase. In 52nd Annual Meeting of the Association for Computational Linguistics (ACL). 956966.Google ScholarGoogle Scholar
  119. [119] Yao Ziyu, Li Xiujun, Gao Jianfeng, Sadler Brian, and Sun Huan. 2019. Interactive semantic parsing for if-then recipes via hierarchical reinforcement learning. In AAAI Conference on Artificial Intelligence (AAAI). 25472554.Google ScholarGoogle ScholarDigital LibraryDigital Library
  120. [120] Yao Ziyu, Peddamail Jayavardhan Reddy, and Sun Huan. 2019. CoaCor: Code annotation for code retrieval with reinforcement learning. In World Wide Web Conference (WWW).Google ScholarGoogle ScholarDigital LibraryDigital Library
  121. [121] Yao Ziyu, Weld Daniel S., Chen W., and Sun Huan. 2018. StaQC: A systematically mined question-code dataset from stack overflow. In World Wide Web Conference (WWW).Google ScholarGoogle Scholar
  122. [122] Yin Pengcheng, Deng Bowen, Chen Edgar, Vasilescu Bogdan, and Neubig Graham. 2018. Learning to mine aligned code and natural language pairs from stack overflow. In International Conference on Mining Software Repositories (MSR). ACM, 476486. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  123. [123] Yin Pengcheng and Neubig Graham. 2017. A syntactic neural model for general-purpose code generation. In Annual Meeting of the Association for Computational Linguistics (ACL).Google ScholarGoogle Scholar
  124. [124] Yin Pengcheng and Neubig Graham. 2018. TRANX: A transition-based neural abstract syntax parser for semantic parsing and code generation. In Conference on Empirical Methods in Natural Language Processing (EMNLP), Demo Track.Google ScholarGoogle Scholar
  125. [125] Yin Pengcheng and Neubig Graham. 2019. Reranking for neural semantic parsing. In 57th Annual Meeting of the Association for Computational Linguistics (ACL). Association for Computational Linguistics, 45534559. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  126. [126] Zavershynskyi Maksym, Skidanov Alex, and Polosukhin Illia. 2018. NAPS: Natural program synthesis dataset. In 2nd Workshop on Neural Abstract Machines & Program Induction (NAMPI).Google ScholarGoogle Scholar
  127. [127] Zelle John M. and Mooney Raymond J.. 1996. Learning to parse database queries using inductive logic programming. In National Conference on Artificial Intelligence. 10501055.Google ScholarGoogle Scholar
  128. [128] Zettlemoyer Luke and Collins Michael. 2007. Online learning of relaxed CCG grammars for parsing to logical form. In Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL). 678687.Google ScholarGoogle Scholar
  129. [129] Zhong Ruiqi, Stern Mitchell, and Klein D.. 2020. Semantic scaffolds for pseudocode-to-code generation. In Meeting of the Association for Computational Linguistics.Google ScholarGoogle Scholar
  130. [130] Zhong Victor, Xiong Caiming, and Socher Richard. 2017. Seq2SQL: Generating structured queries from natural language using reinforcement learning. arXiv preprint arXiv:1709.00103 (2017).Google ScholarGoogle Scholar

Index Terms

  1. In-IDE Code Generation from Natural Language: Promise and Challenges

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Software Engineering and Methodology
          ACM Transactions on Software Engineering and Methodology  Volume 31, Issue 2
          April 2022
          789 pages
          ISSN:1049-331X
          EISSN:1557-7392
          DOI:10.1145/3492439
          • Editor:
          • Mauro Pezzè
          Issue’s Table of Contents

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 4 March 2022
          • Accepted: 1 September 2021
          • Revised: 1 July 2021
          • Received: 1 January 2021
          Published in tosem Volume 31, Issue 2

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format