Program Synthesis from Natural Language Using Recurrent ...mernst/pubs/nl-command-tr170301.pdfFigure...

12
Program Synthesis from Natural Language Using Recurrent Neural Networks Xi Victoria Lin UW CSE Seale, WA, USA [email protected] Chenglong Wang UW CSE Seale, WA, USA [email protected] Deric Pang UW CSE Seale, WA, USA [email protected] Kevin Vu UW CSE Seale, WA, USA [email protected] Luke Zelemoyer UW CSE Seale, WA, USA [email protected] Michael D. Ernst UW CSE Seale, WA, USA [email protected] ABSTRACT Oentimes, a programmer may have diculty implementing a desired operation. Even when the programmer can describe her goal in English, it can be dicult to translate into code. Existing resources, such as question-and-answer websites, tabulate specic operations that someone has wanted to perform in the past, but they are not eective in generalizing to new tasks, to compound tasks that require combining previous questions, or sometimes even to variations of listed tasks. Our goal is to make programming easier and more productive by leing programmers use their own words and concepts to express the intended operation, rather than forcing them to accommodate the machine by memorizing its grammar. We have built a system that lets a programmer describe a desired operation in natural lan- guage, then automatically translates it to a programming language for review and approval by the programmer. Our system, Tellina, does the translation using recurrent neural networks (RNNs), a state-of-the-art natural language processing technique that we aug- mented with slot (argument) lling and other enhancements. We evaluated Tellina in the context of shell scripting. We trained Tellina’s RNNs on textual descriptions of le system operations and bash one-liners, scraped from the web. Although recovering completely correct commands is challenging, Tellina achieves top-3 accuracy of 80% for producing the correct command structure. In a controlled study, programmers who had access to Tellina outper- formed those who did not, even when Tellina’s predictions were not completely correct, to a statistically signicant degree. 1 INTRODUCTION Even if a competent programmer knows what she wants to do and can describe it in English, it can still be dicult to write code to achieve her goal. Programmers increasingly work across libraries and programming languages, creating more complex systems than ever before, and cannot memorize every detail of all the systems that must be used. An increasingly common practice is to seek help from websites such as Stack Overow. Tutorial and question-answering websites are powerful resources with reams of specic examples of code snippets and explanations of their behavior. When a programmer nds her exact question has been asked before, the community- veed answer is invariably useful. However, nding the correct estion 1. I have a bunch of “.zip” les in several directories “dir1/dir2”, “dir3”, “dir4/dir5”. How would I move them all to a common base folder? (hp://unix.stackexchange.com/questions/ 67503) Solution: find dir*/ -type f -name *.zip -exec mv {} basedir \; estion 2. I have one folder for log with 7 sub-folders. I want to delete all the les older than 15 days in all folders including sub-folders without touching folder structure. (hp: //unix.stackexchange.com/questions/155184) Solution: find . -type f -mtime +15 | xargs rm -f Figure 1: Linux command-line questions posted on the Unix Stack Exchange forum. e answer to each is a bash one-liner: a com- mand that can be typed at the bash command line. answer may require the use of keywords the programmer does not know. Furthermore, sometimes outdated or incorrect answers per- sist even if the correct answer also appears. Finally, a programmer who wishes to do something new may waste time searching for it, and then must synthesize it on her own. Even a variation of task on the website may be dicult to nd because of dierent constants and keywords. Despite their limitations, tutorial and question-answering web- sites are valuable sources of information that tools should exploit in order to help developers solve programming tasks. We used these websites to gather training data for a novel natural language (NL) to code translation tool that allows the user to express their intent in English and automatically translates it into executable programs. Such synthesis methods have many advantages. e learned mod- els can oen generalize to new NL descriptions or synthesize novel code, and the programmer does not need to search through large websites to complete her task. However, they are also inherently error-prone. For the near future, all NL-driven synthesis methods will make mistakes and care must be taken to present their output to users in a way they can easily use, modify, or take inspiration from, without requiring that they blindly accept a single system output. is paper presents a complete machine learning approach for natural language (NL) to code translation, along with a detailed user study that demonstrates the eectiveness of the overall approach

Transcript of Program Synthesis from Natural Language Using Recurrent ...mernst/pubs/nl-command-tr170301.pdfFigure...

Program Synthesis from Natural LanguageUsing Recurrent Neural Networks

Xi Victoria LinUW CSE

Sea�le, WA, [email protected]

Chenglong WangUW CSE

Sea�le, WA, [email protected]

Deric PangUW CSE

Sea�le, WA, [email protected]

Kevin VuUW CSE

Sea�le, WA, [email protected]

Luke Ze�lemoyerUW CSE

Sea�le, WA, [email protected]

Michael D. ErnstUW CSE

Sea�le, WA, [email protected]

ABSTRACTO�entimes, a programmer may have di�culty implementing adesired operation. Even when the programmer can describe hergoal in English, it can be di�cult to translate into code. Existingresources, such as question-and-answer websites, tabulate speci�coperations that someone has wanted to perform in the past, butthey are not e�ective in generalizing to new tasks, to compoundtasks that require combining previous questions, or sometimes evento variations of listed tasks.

Our goal is to make programming easier and more productive byle�ing programmers use their own words and concepts to expressthe intended operation, rather than forcing them to accommodatethe machine by memorizing its grammar. We have built a systemthat lets a programmer describe a desired operation in natural lan-guage, then automatically translates it to a programming languagefor review and approval by the programmer. Our system, Tellina,does the translation using recurrent neural networks (RNNs), astate-of-the-art natural language processing technique that we aug-mented with slot (argument) �lling and other enhancements.

We evaluated Tellina in the context of shell scripting. We trainedTellina’s RNNs on textual descriptions of �le system operationsand bash one-liners, scraped from the web. Although recoveringcompletely correct commands is challenging, Tellina achieves top-3accuracy of 80% for producing the correct command structure. In acontrolled study, programmers who had access to Tellina outper-formed those who did not, even when Tellina’s predictions werenot completely correct, to a statistically signi�cant degree.

1 INTRODUCTIONEven if a competent programmer knows what she wants to do andcan describe it in English, it can still be di�cult to write code toachieve her goal. Programmers increasingly work across librariesand programming languages, creating more complex systems thanever before, and cannot memorize every detail of all the systemsthat must be used.

An increasingly common practice is to seek help from websitessuch as Stack Over�ow. Tutorial and question-answering websitesare powerful resources with reams of speci�c examples of codesnippets and explanations of their behavior. When a programmer�nds her exact question has been asked before, the community-ve�ed answer is invariably useful. However, �nding the correct

�estion 1. I have a bunch of “.zip” �les in several directories“dir1/dir2”, “dir3”, “dir4/dir5”. How would I move them all toa common base folder? (h�p://unix.stackexchange.com/questions/67503)

Solution: find dir*/ -type f -name "*.zip" -exec mv {}

"basedir" \;

�estion 2. I have one folder for log with 7 sub-folders. Iwant to delete all the �les older than 15 days in all foldersincluding sub-folders without touching folder structure. (h�p://unix.stackexchange.com/questions/155184)

Solution: find . -type f -mtime +15 | xargs rm -f

Figure 1: Linux command-line questions posted on the Unix StackExchange forum. �e answer to each is a bash one-liner: a com-mand that can be typed at the bash command line.

answer may require the use of keywords the programmer does notknow. Furthermore, sometimes outdated or incorrect answers per-sist even if the correct answer also appears. Finally, a programmerwho wishes to do something new may waste time searching for it,and then must synthesize it on her own. Even a variation of task onthe website may be di�cult to �nd because of di�erent constantsand keywords.

Despite their limitations, tutorial and question-answering web-sites are valuable sources of information that tools should exploit inorder to help developers solve programming tasks. We used thesewebsites to gather training data for a novel natural language (NL)to code translation tool that allows the user to express their intentin English and automatically translates it into executable programs.Such synthesis methods have many advantages. �e learned mod-els can o�en generalize to new NL descriptions or synthesize novelcode, and the programmer does not need to search through largewebsites to complete her task. However, they are also inherentlyerror-prone. For the near future, all NL-driven synthesis methodswill make mistakes and care must be taken to present their output tousers in a way they can easily use, modify, or take inspiration from,without requiring that they blindly accept a single system output.

�is paper presents a complete machine learning approach fornatural language (NL) to code translation, along with a detailed userstudy that demonstrates the e�ectiveness of the overall approach

[+timespan]

X: find all log files older than 15 daysnatural language input:

Step 1: open-vocabulary entity recognition

Step 2: NL template to program template translation

Step 3: argument filling

entity mentions: {filename: “log”, timespan: “15 days”}

: find all [filename] files older than [timespan]

: find [path] -name [regex] -mtime [+timespan]

: find [path] -type f -name [regex] -mtime [+timespan]

: find [path] -type f -perm [permission]

: find [path] -name [regex] -mtime [+timespan] | xargs ls

synthesized program templates:

Y1: find . -name “*.log” -mtime +15

Y2: find . -type f -name “*.log” -mtime +15Y3: find . -name “*.log” -mtime +15 | xargs ls

complete programs:

“log”[regex]

“15 days”

“log” [path]

“log” [+timespan]

“15 days” [path]

“15 days” [regex]

nearest-neighbor classification:

: …

Y4: …

natural language template:X~

Y1~

Y2~

Y3~

Y4~

Y5~ -

-

+

+

-

-

++

++

++

++ +

+

+++ + +

-

-- --- ----

-

-

--

Figure 2: Tellina’s three-step architecture for program synthesis from natural language. Step 2 shows the RNN encoder-decoder model usedfor translating natural language templates to program templates. �e blue rectangles and the yellow rectangles represent the RNN cells forthe encoder and decoder respectively. Step 3 shows the nearest neighbor classi�cation method used for evaluating the compatibility of anentitymention and a program slot. Each data point is an 〈entity, slot〉 pair represented as the concatenation of the hidden state vectors of theircorresponding RNN cells. �e compatible entity-slot pairs, 〈“log”, [regex]〉 and 〈“15 days”, ”[+timespan]”〉, are closer to the positive trainingexamples (“+”s), while the non-compatible pairs are closer to the negative training examples (“-”s).

even when the predicted code is not completely correct. Our system,Tellina, is instantiated for the problem of bash scripting, where auser must perform complex �le system operations. Figure 1 showstwo example tasks that users posed in this domain. Such problemsare worthy of study because programmers must perform themo�en but the individual commands are complex enough to makeit di�cult to remember the exact details. For example, the Linuxfind utility, which searches the �le system, supports more than 70�ags1 to control the search criteria and perform operations on thereturned �les; it is also commonly combined with other commandslike mv or grep for more complex tasks. Man pages describe thesecommands and �ags, but can be hard to discover and understand.Community sites exist2 and cover a wide variety of commands, butcan be hard to search and will not cover all users’ intents. None ofthese problems are speci�c to the command line or shell scripting;similar problems arise in other libraries and languages.

Our learning approach incorporates recent advances in neuralmachine translation [2, 32, 35] within a novel learning architecturethat reduces data sparsity by abstracting over constants in the inputspeci�cations (e.g., names of �les, dates, etc.). A key challenge isto learn to align the correct input strings with parameter valuesfor the output commands, which we show is possible with a k-nearest neighbor classi�er and greedy matching algorithms. Wetrained the model on a newly gathered corpus of over 5000 naturallanguage and bash command pairs. Our experiments show that theproposed model is competitive for this task: evaluated on unseennatural language descriptions, it achieves 80% top-3 accuracy fordetermining command structures and 36% top-3 accuracy for fullcommands.

1Flags are also called command-line options. Some �ags also take arguments.2Including Unix Stack Exchange (h�p://unix.stackexchange.com/), CommandLineFu(h�p://www.commandlinefu.com/), and Bash One-Liners (h�p://www.bashoneliners.com/).

We also present a user study that measures the performance ofprogrammers using this learned model instantiated in an assistanttool, Tellina. Compared to the current state of the art (man pagesand online resources such as question-answering forums and websearch tools), programmers bene��ed from using Tellina to a statis-tically signi�cant degree. For example, despite the fact that Tellinadoes not always produce fully correct command suggestions, itshortened the working time by 21.7% for programmers on bash �lesystem tasks.

In summary, this paper makes the following contributions:

• We propose a novel deep-learning approach for synthesizingprograms from natural language. Our approach combines state-of-the-art recurrent neural networks with a learning approachfor inserting constants into the generated programs.

• We instantiated the approach for a challenging domain: bashcommands containing 17 �le system utilities, more than 200�ags, 9 types of open-vocabulary constants, and nested com-mand structures such as pipelines, command substitution, andprocess substitution.

• In order to provide data for training and evaluation, we col-lected over 5,000 〈NL, command〉 pairs.

• We evaluated the accuracy of our approach. Our model achievestop-3 accuracy of 80.0% for determining program structure —that is, ignoring constant values. Our model achieves top-3accuracy of 36.0% for full commands.

• We conducted a controlled user study to determine whether agood, but not perfectly accurate, model aids end users’ program-ming e�ciency. Compared to existing programming resourcessuch as man pages, Stack Over�ow, and Google, Tellina short-ened the working time by 21.7% for end-user programmerson bash �le system tasks. �is improvement was statisticallysigni�cant (p-value < 0.01).

2

• Our source code and dataset will be released upon publicationto support reproducibility in so�ware engineering research.

2 PROBLEM DEFINITION AND FORMALOVERVIEW

Problem De�nition. Following Desai et al. [5], we de�ne program-ming by natural language (PBNL) as synthesizing a ranked list ofprograms [Y1,Y2, . . . ,Yk ] based on the speci�cation expressed as asingle natural language sentence (denoted asX ). A natural languagesentence is de�ned as a sequence of words X = (x1, . . . ,xm ) anda program is de�ned a sequence of tokens Y = (y1, . . . ,yn ). �isde�nition di�ers from prior work [5], which relies on a context-freegrammar de�nition of the programming language.

Our Approach. We present a machine learning approach forPBNL, which trains the synthesizer using pairs of natural languageand programs (denoted as 〈X ,Y 〉). Figure 1 shows two such trainingexamples. We use a novel three-step deep-learning approach, asillustrated in Figure 2:(1) A user provides a natural language sentence X , which Tellina

transforms to a template X by recognizing and abstracting overdomain-speci�c constants, such as �le names or times (§4.1).

(2) A recurrent neural network (RNN) encoder-decoder modeltranslates X into a ranked list of possible program templates Yi ,where each Yi contains argument slots to be �lled by entitiesrecognized in item 1 (§3).

(3) �e slots in each Yi are replaced by program literals to producean output program Yj , using a k-nearest neighbor classi�erwhich identi�es the corresponding source entities (§4.2).

We train this model on a new corpus of English paired with Bashcommands that we collected (§5.1), which focuses on a complexsubset of the language (see �g. 3). �e complete system, Tellina,permits users to input an example queries. Tellina outputs a list ofproposed bash commands that may solve the user queries.

Our user studies demonstrate that Tellina signi�cantly outper-forms existing alternatives, even when there are mistakes in the setof proposed commands (§7).

3 RNN ENCODER-DECODER MODEL FORTRANSLATION

Inspired by recent progress in neural machine translation [2, 32, 35]and its successful application in the domain of formal languages [6,19, 33], we adopt an RNN encoder-decoder (sequence-to-sequence)model to translate a natural language template into a list of com-mand templates. �is section �rst explains RNNs (§3.1) and encoder-decoders (§3.2). Next, it describes the a�ention mechanism, a com-mon practice which signi�cantly improves the performance ofencoder-decoders (§3.3). Finally, it presents the implementationand training details of the Tellina model (§3.4).

3.1 Recurrent Neural Network (RNN)A recurrent neural network [11] encodes an input sequence ofvectors into a single vector or expands an input vector into anoutput sequence of vectors. In the most general case, it serves bothpurposes at the same time. In our context, the sequences of vectors

In-scope syntax structures:• Single command• Logical connectives: &&, ||, parentheses ()

• Nested commands: pipeline |, command substitution $(),process substitution <()

Out-of-scope syntax structures:• I/O redirection <, <<• Variable assignment =• Parameters $1

• Compound statements: if, for, while, until, blocks, functionde�nition

• Regex structure (every string is a single opaque token)• Non-bash program strings triggered by command inter-

preters such as awk, sed, python, java

Figure 3: �e subset of bash commands used as Tellina’s domain.

represent sequences of words or tokens (English sentences or bashcommands).

An RNN consists of a set of cells, each consisting of three layers:input, hidden and output (�g. 4). Each token/vector in a sequence isprocessed by a di�erent cell in turn. �e input layer maps a tokenin the input sequence to an input state vector:

xt = I (xt ). (1)

�e hidden layer takes the input state and the previous hidden stateas input, and generates the current hidden state as output:

ht = f (ht−1,xt ). (2)

�e output layer takes the current hidden state as input, andgenerates an output state vector:

ot = д(ht ). (3)

Discrete output symbols can be decoded from the output state.�e standard practice is to project ot into a |V |-dimensional space,where |V | is the size of the output vocabulary, and apply the so�maxfunction:

yt = so�max(Woot ). (4)�e so�max function outputs a unit vector. �erefore yt can beinterpreted as a probability distribution of the output vocabularyconditioned on the partial input history x1, . . . ,xt :

yt ∼ p (yt |x1, . . . ,xt ). (5)

In general, I is a look-up table and f , д are non-linear func-tions. �e Tellina model de�nes f and д to be gated recurrent units(GRUs) [4].

3.2 Encoder-decoder modelsAs described in §3.1, an RNN can be used to encode an input se-quence or to generate an output sequence. For most translationproblems, the input sequence and output sequence are of di�er-ent lengths; hence, it is di�cult to use a single RNN to modelboth. Such problems can be solved using the combination of an

3

y1

h1

x1

y2

h2

x2

y3

h3

x3

y4

h4

x4

y5

h5

x5

find [filename] files in the

y6

h6

x6

current

Encoder y’1

h’1

x’1

y’2

h’2

x’2

y’3

h’3

x’3

y’4

h’4

x’4

y’5

h’5

x’5

<START>

find [path] -name [regex] <EOS>

Decodery7

h7

x7

folder [path] -name [regex]find

Figure 4: Con�guration of the Seq2Seq neural network translation model. Each layer (input, hidden, and output from bottom to top) islabeled with the state vector it produces. �e encoder reads the natural language description and passes its �nal hidden state to the decoder.�e decoder consumes the encoder’s �nal hidden state and generates the program starting from the special symbol <START>. For the decoder,each input symbol is the output symbol from the previous step (denoted by the dotted lines connecting the input layer at each step with theprevious output layer). �egreen dotted linesmark theword alignments learned via the attentionmechanism. While the attentionmechanismcomputes an alignment score for each pair of encoder hidden state and decoder hidden state, the �gure illustrates only the alignments withhigh scores.

encoder RNN and a decoder RNN, which is commonly referred toas encoder-decoder modeling (Seq2Seq, �g. 4).3

As shown in �g. 4, the encoder RNN and decoder RNN are con-nected in the hidden layer. �e source sequence is fed into theencoder RNN. �e output sequence is decoded from the decoderRNN, and the generation is biased by the �nal hidden state of theencoder RNN. �e input symbol at step t is its output symbol atstep t − 1. Using eq. (4):

y′t ∼ p (y′t |y′1, . . . ,y

′t−1,h�nal ). (6)

Applying the chain rule to eq. (6), the decoder RNN de�nes a con-ditional probability distribution of the target sequences given the(encoded) source sequence:

p (y′1, . . . ,y′T ′ |h�nal ) =

T ′∏t=1

p (y′t |y′1, . . . ,y

′t−1,h�nal ). (7)

�e translation problem is hence reduced to decoding the targetsequence with the maximum conditional likelihood:

Y ′ = argmaxy′i , ...,y

T ′

p (y′1, . . . ,y′T ′ |h�nal ). (8)

�e optimization in eq. (8) cannot be computed e�ciently sincep (y′1, . . . ,y

′T ′ |h�nal ) does not factorize step-wise. E�ective approxi-

mations include beam search [32] (which Tellina uses) or greedilypicking the token with the maximum local score at each step.

3.3 AttentionRecent work in neural machine translation has shown that themodel can bene�t from making the prediction of a target outputtoken depend directly on the encodings of the input tokens [2, 33].For example, in �g. 4 the choice of the output token “[regex]” mightbe triggered by the tokens “[�lename]” and “�les” in the sourcesequence.

An a�ention mechanism [33] can be used to �nd and encodethe relevant inputs. Speci�cally, it computes an alignment model3Some recent work used tree-structured RNNs in the encoder-decoder framework [6,27]. We tried both. A Seq2Tree network did not yield signi�cant performance im-provement compared to Seq2Seq, but was dependent on a speci�c grammar de�nition.Future work can investigate more sophisticated encoder-decoder architectures.

α between the encoder and decoder hidden states, and a contextvector ct which is a sum of all encoder hidden states weighted byα :

αti = a(h′t ,hi ) (9)

ct =T∑i=1

αtihi . (10)

�e context vector ct is then used as the input to the decoder outputlayer together with the current hidden state h′t

o′t = д(h′t ,ct ). (11)

In general, αti is de�ned to be the probability that the output atstep t of the decoder is translated from the input at step i of theencoder. Following Dong and Lapata [6], we de�ne αti as the innerproduct between the decoder hidden state and the encoder hiddenstate, normalized over all encoder steps.

αti =exp(h′t · hi )∑Tj=1 exp(h′t · hj )

. (12)

3.4 Training and hyperparameter setting�e Tellina model uses a bi-directional RNN [31] encoder, whichconsists of a forward RNN that reads the input sequence in the usualorder and a backward RNN that reads the input sequence in thereversed order. �e hidden states of the two RNNs are concatenatedto generate the output state. �e rest of the construction is thesame as the base model presented above.

We trained the encoder-decoder using pairs of natural languageand program templates. We use the standard sequence-to-sequencetraining objective [32] which maximizes the likelihood of the ground-truth program template given the natural language template. Wetrained the neural network (consists of all token embeddings andlayer parameters) end-to-end using mini-batch gradient descentwith Adam [14].

We set up the decoder RNN to be 400-dimensional, and the twoRNNs in the bi-directional encoder to be 200-dimensional. Weset the mini-batch size to 16, the initial learning rate to 0.0001,

4

• Pa�ern– File: �le name– Directory: directory name– Path: absolute path– Permission: Linux �le permission code– Date/Time: date and time expression– Regex: other pa�ern arguments

• �antity– Number : number– Size: �le size– Timespan: time duration

Figure 5: Two-level type hierarchy of the open-vocabulary entitiesde�ned on the �le system operation domain.

and the momentum hyperparameters of Adam to their defaultvalues [14]. We set the beam size to 100 for beam-search decoding.�e hyperparameters were set based on the model’s performanceon a development dataset (§5.3). We will release our trained modeland source code to support reproducible research.

4 ARGUMENT FILLING�e neural encoder-decoder model (§3) takes a templated NL sen-tence as input and produces a ranked list of program templates,which contain empty argument slots to be �lled. In this section, wedescribe how the NL templates are constructed and how the pro-gram slot values are �lled, to complete the full processing pipeline.

4.1 Template GenerationWe use a domain-speci�c heuristic approach for �nding and ab-stracting over the textual entities in the input sentences and thecorresponding argument values in the programs they are pairedwith.

In the domain of bash commands, we identi�ed two categoriesof open-vocabulary entities, pa�erns and quantities, and severalsecond-level semantic types (�g. 5). To recognize and assign types tothe entity mentions in the natural language sentence, we manuallyde�ned a regular expression for each semantic type to match theentities of that type.4 �e recognizers for quantities perform wellin general, since numerical expressions are strong predictors forsuch entities. However, these recognizers perform poorly for a fewsemantic types. For example, instead of writing a date expression instandard format like “08/02/2017”, the user may describe it verbally,e.g., “the 2nd of August of 2017”, and our heuristic pa�erns cannotcover all possible cases. To generate a natural language templateused as input to the sequence-to-sequence RNN from a sentence,we replace each open-vocabulary entity in the sentence with itssemantic type.

We also generate command templates from commands. First, wetype each argument in the command according to its type de�nitionin man pages. Second, we map the man-page types to one of the9 semantic types de�ned in �g. 5. For example, in Linux manpages, the target directory argument of find is of type path, and the

4�e regex-based recognizers we de�ned perform recognition solely based on thesurface form of the entity mentions and do not leverage the context.

argument of the option -name is of type pa�ern; our rules map pathto Path and pa�ern to Regex. Finally, we replace each commandargument with its semantic type.

4.2 Program Slot FillingEntity mentions in an NL sentence o�en form a one-to-one mappingwith a subset of the arguments in its corresponding program. Hence,Tellina performs argument �lling in two steps. It �rst aligns the listof entity mentions with the program slots using the stable matchingalgorithm (§4.2.1, §4.2.2). �en, it extracts the argument values fromthe aligned entity mentions and �lls the argument slots to form thecomplete program (§4.2.3).

Aligning entities with argument slots is challenging becauseof the following two types of ambiguity. First, there are usuallymultiple entity mentions and multiple argument slot. While type-matching provides a strong signal for the alignment, there mayexist multiple entities of the same type. Such ambiguities needto be resolved based on the contextual information of the entitymentions. For example, in the task of “moving some �les from thesource directory to the target directory”, the algorithm needs tocorrectly identify the source and target directory without swappingtheir positions. Second, inferencing entity and argument typesbased on heuristics may be imprecise, and wrong types can leadto wrong alignment results. For example, heursitically, we cannotcon�dently determine whether the substring “2017*” in a user’sinput refers to a list of �les, directories, or some other entity withoutlooking into its context.

To tackle above challenges, our slot-�lling algorithm makes useof the contextual information of the entities and the slots.

4.2.1 Global entity-slot alignment. Algorithm 1 computes thealignment between the entities and slots using a modi�ed versionof the stable matching algorithm [21]. Besides entities and slots,the algorithm also takes as input a local entity-slot match scoringfunction γ (i, j ) whose output represents how likely the entity eiis to be matched with the slot sj based on local information. Forexample, such a local function may be type based, and “15 days”is more likely to be an argument of -mtime compared to “log �les”since their types match.

Upon initialization, the algorithm �rst computes a preference listof slots for each entity, based on the local matching scores returnedby γ (i, j ). At each iteration of the while loop, the algorithm selectsan unmatched entity mention ei with a non-empty preference listSei , and a�empts to match it with the highest-ranked slot sj inSei . If sj is not matched or sj prefers ei to the entity mention ei′ itis currently matched to, ei and sj become matched. Otherwise eistays unmatched.

�e while loop is guaranteed to terminate. Upon termination, itis guaranteed that every entity is aligned to at most one slot [21]. Ifevery entity mention is aligned to exactly one argument slot, it re-turns the alignment. Otherwise, it returns the null value indicatingnot all entities can be aligned to a slot.5 In both cases, it is possiblefor some argument slots to be le� unmatched.

5In practice, we found this is an e�ective rule for detecting wrong program templates.�erefore Tellina �lters out all command templates that do not have enough argumentslots for the recognized entities.

5

Algorithm 1: Global entity-slot alignmentInput : List of entities E, list of argument slots S , local

entity-slot compatibility function γ (i, j ).Output : List of matched entity-slot pairs M if every entity is

aligned to a slot; null otherwise.1 M = ∅;2 /* compute the preference list for each entity */3 for ei ∈ E do4 Priority�eue Sei ;5 for sj ∈ S do6 if γ (i, j ) ,-inf then7 Sei .Enqueue(sj )8 end9 end

10 end11 /* compute the stable alignment */12 while ∃ei s.t. ∀sj (ei ,sj ) < M ∧ Sei , ∅ do13 sj = Sei .Dequeue();14 if ∃ei′ s.t. (ei′ ,sj ) ∈ M then15 if γ (i ′, j ) < γ (i, j ) then16 M = M ∪ {(ei ,sj )} \ {(ei′ ,sj )};17 end18 else19 M = M ∪ {(ei ,sj )};20 end21 end

For some slots that are le� empty, we automatically �ll in defaultvalues based on heuristics. For example, we use “.” as the defaultvalue for the target directory of find, since the users o�en omit thetarget directory in their NL description if it is the current directory.If a slot is not de�ned with a default value, we leave it empty whenpresenting to users.

4.2.2 K-nearest neighbor classifier for local entity-slot matching.We developed a novel k-nearest neighbor formulation for the lo-cal matching function γ (i, j ) introduced in alg. 1. It is based onthe intuition that given the embeddings of the entity mentionsand argument slots generated by the encoder-decoder (�g. 4), theembeddings of a matched entity-slot pair shall be closed to othermatched entity-slot pairs in the embedding space and farther awayfrom the unmatched entity-slot pairs.

Speci�cally, we sampled a set of positive and negative entity-slotpairs from our training data and use them as the training set ofthe k-NN classi�er. We represent each entity-slot pair (ei ,sj ) usingthe concatenation of the hidden state vectors (hi ,h′j ) of the neuralencoder-decoder model. For each test (ei ,sj ) pair, we de�ne γ (i, j )as a voting score weighted by the distances between (hi ,h′j ) andall the positive and negative training examples in the embeddingsspace:

γ (i, j ) =∑

(c,d )∈NN (i,j,k )

d(i,j ), (c,d ) · v (c,d ), (13)

where NN (i, j,k ) is the set of k-nearest neighbors of (ei ,sj ) in thetraining set. d(i,j ), (c,d ) is the distance function and v (c,d ) is the

vote of neighbor example (ec ,sd ), as de�ned below

d(i,j ), (c,d ) = cos((hi ,h′j ), (hc ,h′d )), (14)

v (c,d ) =

1, if (ec ,sd ) match0,otherwise

. (15)

�is approach leverages the contextual information of both theentity mention and the argument slot, since the hidden state rep-resentations of ei and sj encode their context information. Forexample, the hidden states of the two occurrences of “css” in thesentence “�nd all css �les in the folder css” are di�erent, becausethey are surrounded by di�erent words.

4.2.3 Post-processing. Given the aligned entities and argumentslots, extracting the entity values and reforma�ing them to com-plete the argument �lling is easier and heuristics-based approachworks well. Some entities such as �le paths and string pa�erns canbe directly copied into the slots. We de�ne a small set of heuristicrules for reforma�ing the entities that cannot be directly copied,such as adding wild-card symbol “*.” in the front of a �le extensionand converting “x weeks” to “7x days”. �ese rules o�en work wellin practice, but do introduce some errors. For example, if a user de-scribes a regular expression pa�ern verbally as “all log �les whosename starts with 2016”, we do not have a rule which can infer theregular expression “2016*.log”. Further improving argument valueextraction is an interesting area for future work.6

5 DATAWe collected 8,000 pairs of NL description and bash command fromthe web (§5.1), from which 5,413 pairs remained a�er �ltering(§5.1). We split this data into train, dev, and test sets, subject to theconstraint that no NL description appears in more than one dataset(§5.3). Our dataset is publicly available for use by other researchers.

5.1 Data CollectionWe hired freelancers who were familiar with shell scripting throughUpwork7 to collect data. �ey searched for web pages that containbash commands and recorded command-text pairs from them. Eachtext in a pair is a sentence that describes the command, either ex-tracted from the webpage or described by the freelancer based ontheir background knowledge and web page contexts. We restrictedthe bash commands to be one-liners and the natural language de-scription to be a single sentence. �e source web pages include tuto-rials, tech blogs, question-answering forums, and course materials.�e freelancers collected data through a web interface developedby us, which helps them with page searching, pair recording, andduplicate data elimation. On average, each freelancer collected 50pairs per hour.

Filtering. A�er obtaining text-command pairs colecled by thefreelancer, we �rst �lter the dataset with the following rules. First,we discarded all commands that cannot be parsed by the bash parser

6 Locascio et al. [19] uses RNN encoder-decoder to synthesize regular expressionsbased on a natural language description. Integrating a similar module into Tellina’sneural architecture to synthesize the argument content could be an interesting futuredirection.7h�p://www.upwork.com/

6

Bashlex8. Second, we discarded all commands that contain out-of-scope bash operators, as shown in Figure 3. Finally, we discardedcommands that contain an operator appeared less than 20 times inour dataset (e.g. cut, du, bzip2, etc.), since the model is unlikely tolearn them from the data due to sparsity.

Cleaning. A�er �ltering, we clean both texts and commandsin the data set. For texts, we use a probabilistic spell checker9 tocorrect spelling errors in the English descriptions. We also manuallycorrected a subset of the spelling errors that bypassed the spellchecker in the English and in the bash commands.

For commands, we �rst remove sudo and shell input promptcharacters such as “$” and “#” from the beginning of each com-mand and replaced absolute command pathnames by their basenames (e.g., we changed “/bin/�nd” to find). �en, we used a parseraugmented from Bashlex to parse the command into an AST. �eBashlex AST handles nested command structures such as pipelines,command substitution, and process substitution. We augmentedit to map each argument to the utility or utility �ag it a�aches to,using the command syntax de�ned in the Linux man pages. �eaugmented syntax is used to generate the bash command templates,as described in §4.1.

5.2 Data StatisticsA�er �ltering and cleaning, our dataset contains 5,413 〈NL, bash〉pairs. �ese commands contain 17 di�erent bash utilities and morethan 200 �ags. In descending order of frequency, the utilities arefind, xargs, grep, egrep, fgrep, ls, rm, cp, mv, wc, chmod, chown, chgrp,sort, head, tail, tar.

Similar to other machine translation datasets [26], in our 〈NL,bash〉 dataset, one natural language description may have multi-ple corresponding correct bash command solutions, and one bashcommand may be phrased in multiple di�erent NL descriptions.In general, higher number of such multiple-to-multiple correspon-dences between NL descriptions and bash commands implies biggerchallenges for learning and evaluation. First, we are unlikely tocollect all possible translations for an NL description, and the modelcould wrongly penalize correct predictions which it does not rec-ognize during training. Second, at test time, the model may predictcorrect translations that are di�erent from the ground-truth, andmanual evaluation has to be done to compute the correct evaluationmetrics (§6.1.2). We present the statistics of such correspondencesin Table 1.

5.3 Data splitWe split the �ltered data into train, development (dev) and testsets. We �rst clustered the pairs by their NL templates — a clustercontains all pairs with the identical NL template. �en, we randomlysplit the clusters into 80% training, 10% dev, and 10% test. �isprevents the model from testing a NL template that was includedin the training set, which allows us to evaluate the model’s abilityto generalize to new NL inputs. Table 1 shows the statistics of ourdata split.

8h�ps://github.com/idank/bashlex9h�p://norvig.com/spell-correct.html

# pairs cmd/nl percentage nl/cmd percentageTrain 4330 1.21 13% 2.13 27%Dev 559 1.13 9% 1.39 19%Test 524 1.19 12% 1.40 15%

Table 1: Data statistics. “cmd/nl” is the average number of bash com-mands per NL description, and the following column is the percent-age of NL with more than one bash command translations. Simi-larly, “nl/cmd” is the average number of NL descriptions per bashcommand, and the following column is the percentage of bash com-mand with more than one NL descriptions.

Following the machine learning practice, we �rst trained ourmodel on the training set and use the dev set to tune the hyper-parameters (§3.4). �en, we trained our �nal model on the com-bination of both training and dev sets, with the hyperparametersachieving translation accuracy on the dev set (§6.1) obtained duringparameter tuning phase.

6 MODEL EVALUATIONWe report the end-to-end translation and argument �lling accuracy(§6.1) for our new bash dataset, and a short discussion of qualitativeresults (§6.2).

6.1 Translation AccuracyWe report the end-to-end translation accuracy, both automaticand manually computed, of the Tellina model and a code retrievalbaseline.

6.1.1 Baseline Model. We implement a code retrieval (CR) base-line using the tf-idf information retrieval (IR) technique [20]. �eIR model encodes every NL description in the dataset into a bag-of-words feature vector using the tf-idf statistics. Given a test example,it computes the cosine-similarity between the test NL feature vectorand the training NL features vectors, and returns the top-k mostsimilar commands. We improved the IR model by �rst retrievingthe command template using the NL template similarity, and thenperform argument �lling for the NL template using type-matchingheuristics.10

6.1.2 Evaluation Methodology. We report two types of accuracy:top-k full-command accuracy (AcckF ) and top-k command-templateaccuracy (AcckT). We de�ne AcckF to be the percentage of test ex-amples11 for which a correct full command is ranked k or above inthe model output, and AcckT to be the percentage of test examplesfor which a correct command template is ranked k or above in themodel output (i.e. ignoring errors in the constants).

As described in §5.2, many test examples have more than onecorrect commands and our collected data may not cover them all.�erefore, we asked three freelancers from Upwork who are famil-iar with shell scripting to evaluate the model output manually. �efreelancers independently examined the top-3 translations of boththe CR baseline and the Tellina model for all test examples, and

10We found the retrieval results based on full NL description to be signi�cantly worse,since the constants are likely to receive high idf weights while not being representa-tive of the command semantics.11We treat the 〈bash, NL〉 pairs with the same NL description as a single test example.

7

Model Acc1F Acc3

F Acc1T Acc3

TCR Baseline 13.0% 20.6% 54.7% 67.9%

Tellina Model 30.0% 36.0% 69.4% 80.0%Table 2: Translation accuracies of the Tellina model and the coderetrieval baseline.

k Precision Recall F11 82.9 87.0 84.95 84.6 89.0 86.710 82.1 86.2 84.1100 79.8 84.0 81.9200 77.2 81.2 79.1

Table 3: Development set performance of the argument �lling com-ponent for di�ering k nearest neighbor values.

evaluate correctness of the commands at both the full-commandlevel and the command-template level. For each command transla-tion, we use the majority vote of the three freelancers as the �nalevaluation.

6.1.3 Results. Table 2 shows the translation accuracies of boththe Tellina model and the CR baseline. �e Tellina model beats theCR baseline by a large margin on all accuracy metrics. It achievesstrong template-based accuracy, up to 80% on the top-3 metric, in-dicating the e�ectiveness of the neural encoder-decoder model. Onthe other hand, both models struggle to generate full commands,o�en making mistakes with one or more of the command argu-ments. Nonetheless, as we will see in §7.6, users still found thesepredictions useful, even if the �nal output needs some correctionsbefore it can be executed.

We also evaluate the argument �lling component individuallyby using alg. 1 to �ll in the arguments for ground truth templates.Table 3 shows the precision, recall, F1 measurement on the devel-opment set with varying k parameters. �e model presents highaccuracy in �lling arguments to the templates (∼86.7% F1 with opti-mal k). However, due to cascading errors from entity detection andtemplate selection, the overall command correctness ratio is muchlower than template correctness ratio, as we will discuss below.

6.2 Error AnalysisWhile the template correctness and argument �lling accuraciesare high (Table 2), we observed that complete command accuracyis much lower overall. To be�er understand this phenomena, wesampled 50 synthesized commands whose template is correct butfull command is incorrect.

Among these 50 incorrect commands, 41 of them are caused byTellina’s failure to recognize the argument from the NL description,5 are caused by wrong command alignment, 2 caused by the factthat the NL description does not provide a concrete argument, andthe last 2 are caused by predicting incorrect templates (which thefreelancers failed to catch). �e vast majority of the NL entity recog-nition failures involve missing idioms in the task description. Forexample, our algorithm is unable to extract the directory argument‘/’ from the idioms “root directory” or “full �le system”, and so is

the case for extracting permission code “100” from idiom “readpermission”.

However, while this type of error harms the full-command trans-lation accuracy of Tellina model, it does not signi�cantly harm thetool usability: we observed in our user study (§7) that many userscan easily formulate arguments that our algorithm failed to recog-nize based on their knowledge of the �le system; as a result, theseerrors can be easily �xed given correct template and alignment.

7 USER STUDYWe conducted a user study to determine whether Tellina helpsprogrammers complete �le system tasks using bash.

7.1 Experiment DesignWe measured each participant’s performance under the followingtwo treatment conditions.

• Control treatment: �e participant may use any local resource(such as man pages and experimentation on the commandline) and any Internet resource (such as tutorials, question-and-answer websites, and web search). �is emulates how aprogrammer would normally solve a �le system task.

• Experimental treatment: �e participant may use any of theabove resources, and also Tellina.

We adopted a counterbalanced factorial design in which eachparticipant performs two sets of tasks, one with each treatment.�is design prevents confounding due to order e�ects by randomlyassigning the participants into four groups, which correspond toall four taskset and treatment combinations.

We recruited 39 students in the computer science major to partic-ipate in the experiment (24 graduate students, 15 undergraduates).None were familiar with Tellina. All of them were familiar withbash. We accepted only graduate students who self-reported to bebash users, and we accepted only undergraduates who had com-pleted or were enrolled in our department’s Linux tools course.

We excluded data from 4 of the participants, because 3 of themforgot to switch treatment conditions between the tasksets and 1of them did not complete the study.

7.2 TasksEach taskset is made up of 8 tasks. Each task consists of an Eng-lish description of a �le system operation, and a �le system ina pre-de�ned initial state. �e desired outcome is either a listof �les (possibly with additional a�ributes such as modi�cationtime or number of lines) or a change in the �le system, such asdeleted/added/modi�ed �les. �e user’s goal is to write a bash com-mand that performs the desired operation without causing extra�le system changes. All tasks have outcomes that are automaticallyveri�able.

�e participant may a�empt a task multiple times. �e partici-pant may reset the �le system to its original state in order to startover from a fresh slate. Each task has a 10-minute timeout. If theparticipant gives up on a task, we count the participant as havingspent 10 minutes. In addition, each taskset has a time limit of 40minutes.12

12Eleven participants hit this time limit in the experiment (o�en only for the �rsttaskset).

8

7.2.1 Selection and filtering. �e experiment uses real tasksselected from four websites o�ering programming help: StackOver�ow (h�p://stackover�ow.com/), Super User (h�p://superuser.com/), commandlinefu.com (h�p://www.commandlinefu.com/), andBash One-Liners (h�p://www.bashoneliners.com/).

To obtain candidate tasks from the �le system domain, we �rstextracted all questions from these websites tagged with “bash” and“�nd”, and obtained 401 questions. We retained the 146 of themthat can be answered using the 17 bash utilities that appear inTellina’s training set. We did not �lter tasks by the �ags, and a taskmay require a �ag that is not in Tellina’s training set. Finally, werandomly sampled 16 out of the 146 tasks. �e answers to those 16tasks use find, xargs, mv, cp, rm, grep, wc, ls, and tar.

7.2.2 Rewriting. We asked researchers who are not involved inthis project to paraphrase the descriptions of the selected tasks. �isprevents the original task from being trivially found on the webin case the user copy-and-paste the task description into a searchengine. Eight researchers in total contributed to the rewriting,and each of them rewrote 1 to 3 tasks. �is avoids biasing theparticipants’ phrasing of their natural language queries to be similarto a speci�c writer.

7.2.3 File system. We selected a code repository from GitHub13

and used it as the �le system for our tasks. �e �le hierarchy is 4levels deep and consists of 29 folders and 62 �les. We changed allconstant values in the task descriptions to match the contents ofthis �le system.

7.3 Tool InterfaceTellina is a web application which can be accessed through a URL.It has an interface similar to the Google search engine. A user typesa natural language sentence describing a task, then the websitedisplays the RNN model’s top 20 bash command translations of thesentence.

To help users understand the output commands, a user can hoverover a token, such as a program name or �ag, and see the man pagedescription of the token. To provide a level playing �eld and avoidcon�ating this explanation feature with use of Tellina, we gavethe participants a third-party tool, explainshell (h�p://explainshell.com/), which provides exactly the same man-page explanationfunctionality in the control treatment.

To help users in all treatments understand the e�ects of theircommands, we provided a visual �le system comparison tool (simi-lar to Araxis Merge, KDi�3, or Meld) that indicates which �les anddirectories di�er and permits the user to interactively navigate the�le system. �is enables a user to understand what is wrong withhis or her command without interpreting voluminous, possiblyconfusing di� output.

7.4 TrainingBefore starting a taskset, each participant completes a trainingtask with the appropriate treatment condition. �e Tellina websiteincludes a tutorial to explain the usage of the interface (e.g., theinput should be a complete imperative sentence) and the output

13h�ps://github.com/icecreamma�/class-website-template. �is repository is ran-domly selected and is not related to the authors of this paper.

Variable Domain

Independent

Subject {1, . . . ,35}Treatment {Control, Tellina}

Taskset {TS1,TS2}Order {1st,2nd}

Dependent Time spent (sec.) [0, 2400]Success rate [0,1]

Table 4: Experimental variables. Order indicates whether thetaskset was the user’s �rst or second taskset. Success rate is thefraction of the 8 tasks in the taskset that the user completed suc-cessfully.

presentation. We assumed all participants were familiar with websearch and man pages and did not provide training for them.

7.5 �antitative resultsTable 4 lists the independent and dependent variables. We per-formed a four-way analysis of variance (ANOVA) for each depen-dent variable. �e statistically signi�cant results (p < 0.01) are thatall four independent variables predict the time spent, and subjectand taskset also predict the success rate.

�e e�ects of subject on time are expected, because the par-ticipants are at di�erent bash pro�ciency levels. �ere was nostatistically signi�cant e�ect of subject on success rate because thegenerous time limits (10 minutes per task, 40 minutes per taskset)enabled most users to complete most tasks. �e overall success ratewas 88%, and we were more interested in how to help programmersbecome more e�cient, because we know that programmers canmanage to solve tasks if given enough time.

�e e�ects of order are also expected. On average, the partici-pants spent 20% more time on the �rst taskset they encounteredthan on the second (1767 seconds vs. 1414 seconds), but were lesssuccessful (84% vs. 92% success rate). �is learning e�ect re�ects in-creasing participant familiarity with the example �le system, toolssuch as �le system di�, and bash tricks they learned earlier in theexperiment.

�e e�ects of taskset indicate that we failed to create two tasksetsof equal di�culty. Users spent 24% more time, but were 10% lesssuccessful, with taskset TS2.

Our counterbalanced factorial design enables the most interest-ing e�ects, those of treatment, to be accurately determined despitethe e�ects of subject, order, and taskset. Participants in the Tellinatreatment spent on average 22% less time (1397 seconds vs. 1784 sec-onds). �is indicates that Tellina helps programmers to write bashcommands in less time. �e e�ect of treatment on success rate issigni�cant only at the p < 0.1 level (µTellina = 90%, µControl = 85%),for the reasons noted above when discussing the e�ect of subjecton success rate.

7.6 �alitative ResultsEach participant �lled out a questionnaire about their experienceduring the study.

�e �rst part of the questionnaire consists of four Likert-scalequestions (table 5). On average the participants wanted to useTellina in the future (5.8/7). Tellina’s partially correct suggestionswere helpful (5.2/7) and did not slow down the users (3.2/7), but

9

�estion ResponseDo you want to use Tellina in the future? 5.8 ± 1.2How o�en did partially correct suggestionshelp you? 5.2 ± 1.5

How o�en were you slowed down by theincorrect suggestions made by Tellina? 3.2 ± 1.4

How easy was it for you to correct theincorrect suggestions made by Tellina? 4.6 ± 1.2

Table 5: Mean and standard deviation of the participant responsesto the Likert-scale questions. All questions have scale 1–7.

What features of Tellina are the most helpful to you?What mistakes made by Tellina a�ected you most?Please list the features that you think we should add to Tellina.Please give us any additional comments you have about Tellina.Table 6: Open-ended questions in the post-study questionnaire.

were di�cult to correct (4.6/7). �ese results con�rm our hypothesisthat programmers are resilient to noise in the Tellina output andcan use it as inspiration when formulating the correct commandthemselves. Anecdotally, while using Tellina ourselves we learnedabout command-line �ags that we had not known about.

�e second part of the questionnaire was four open-ended ques-tions asking the participants to comment on speci�c features ofTellina and suggest future improvement (table 6). Below summa-rizes our �ndings.

Useful features. Many participants noted the usefulness of par-tially correct suggestions. Even when the full command was notcorrect, the web interface gave documentation and prompted theusers to look up command options that they otherwise would nothave remembered. Participants also liked the fact that Tellina sug-gests multiple solutions, which allows users to compare them anddecide which one to try. Some participants liked Tellina’s argument-�lling feature. Tellina returned an answer with constants appropri-ate to the user’s query, enabling the user to try out the commandimmediately without having to adjust the constants manually.

Limitations. Some participants noted that when Tellina sug-gested a wrong command that was close to a correct suggestion,it was di�cult to trouble-shoot exactly what went wrong. Partici-pants were also frustrated by subtle syntactic errors that preventedthe commands from being run as is. (Tellina gives no guarantee thatits output is a legal bash command.) Some participants were sloweddown by incorrect argument forma�ing in Tellina’s output, such asa missing “+” sign in the argument to find’s -mtime �ag. Sometimesthe RNN model suggested unusually complex commands (e.g., longpipelines), and the participants found them distracting even if someof the commands in them were correct.

Suggestions. Many participants requested be�er explanations ofthe output commands, so that they can be�er decide which one totry. For example, Tellina could incorporate a bash→English transla-tor (either hand-coded or a learned RNN model) to explain its bashcommand output. Some participants suggested �agging which com-mands are syntactically valid, or providing sample output. Some

participants also suggest interactive features that would enable theuser to correct the output or give hints such as “the task needs tobe solved using the tar utility”.

8 RELATEDWORKProgramming by natural language. �ere has been extensive

research on synthesizing programs from natural language descrip-tions. Rule-based approaches have been developed to synthesizeSQL queries [18], SmartPhone scripts [17], Java method speci�ca-tions [25], and Spreadsheet programs [9]. �ese methods can o�enproduce complex programs, but may require non-trivial manuale�ort to build and maintain. Recently, machine learning techniqueshave been developed to build probabilistic models to represent thejoint distribution of text and programs [5, 10, 13]. We extend thisline of work by introducing an RNN encoder-decoder approachwhich can be applied to many di�erent synthesis problems withrelatively li�le domain-speci�c e�ort.

Our approach is also related to other deep-learning based PBNLapproaches. DeepAPI [8] addresses the problem of retrieving APIcall sequences based on the user’s natural language queries, us-ing the RNN encoder-decoder model. CodeMend [30] proposedencoder-decoder models which complete partial programs by jointlymodeling user’s natural language input and the contextual pro-grams. In comparison, Tellina directly synthesizes executable pro-grams from NL descriptions and is able to handle open world argu-ments using our novel argument �lling algorithm. Neural Program-mer [23, 24] is a recently proposed deep learning architecture thatallows direct encoding of discrete operators to improve learning e�-ciency. Since command languages cannot be succinctly representedusing a few core operators, it is di�cult to apply this approach toour domain. In addition, we present the �rst controlled user studydemonstrating signi�cant e�ectiveness of such techniques, despitetheir imperfect accuracy.

Semantic parsing. �e problem of mapping natural language toprograms or other formal representations has been extensively stud-ied in the natural language processing community [1, 3, 6, 29, 36].Earlier machine learning research in this area focus on learningformal grammars for mapping language to meaning representa-tions [3, 36]. However, it has recently been shown that deep-learning based approaches work equally well, for example to pro-duce regular expressions [16] and database queries [6, 23]. We ex-pand the domain by studying deep learning for producing commandlanguages, and evaluate the e�ect of such systems on programmers.

Deep learning in so�ware engineering. Finally, deep-learningbased approaches have also been applied to other so�ware en-gineering problems, such as defect prediction [34], program featureextraction [22, 28], summarizing code using natural language [12],and program induction [7, 15]. In general, these applications lever-age big data and design custom neural architectures for each appli-cation. In comparison, our focus is on studying the usefulness RNNencoder-decoder models, which have been shown to work well fora wide variety of program synthesis problems.

10

9 CONCLUSION�is paper presents an approach for program synthesis from natu-ral language that leverages state-of-the-art neural machine transla-tion techniques, augmented with slot (argument) �lling and othertechniques. We studied the complex domain of bash �le systemoperations and conducted a controlled user study which shows thatour tool, Tellina, signi�cantly improves programmers’ e�ciencydespite being imperfect in its program predictions.

As future work, it would be interesting to extend our neuralarchitecture so that the entire framework for entity recognition,template translation and argument �lling can be learned end-to-end.It should also be possible to extend the approach to cover moreprogramming languages.

REFERENCES[1] Yoav Artzi and Luke Ze�lemoyer. 2011. Bootstrapping Semantic Parsers from

Conversations. In Proceedings of the Conference on Empirical Methods in Natu-ral Language Processing (EMNLP ’11). Association for Computational Linguis-tics, Stroudsburg, PA, USA, 421–432. h�p://dl.acm.org/citation.cfm?id=2145432.2145481

[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural MachineTranslation by Jointly Learning to Align and Translate. CoRR abs/1409.0473(2014). h�p://arxiv.org/abs/1409.0473

[3] Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 2013. SemanticParsing on Freebase from �estion-Answer Pairs.. In EMNLP, Vol. 2. 6.

[4] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014.Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling.CoRR abs/1412.3555 (2014). h�p://arxiv.org/abs/1412.3555

[5] Aditya Desai, Sumit Gulwani, Vineet Hingorani, Nidhi Jain, Amey Karkare,Mark Marron, Sailesh R, and Subhajit Roy. 2016. Program Synthesis UsingNatural Language. In Proceedings of the 38th International Conference on So�wareEngineering (ICSE ’16). ACM, New York, NY, USA, 345–356. DOI:h�p://dx.doi.org/10.1145/2884781.2884786

[6] Li Dong and Mirella Lapata. 2016. Language to Logical Form with Neural A�en-tion. In Proceedings of the 54th AnnualMeeting of the Association for ComputationalLinguistics (Volume 1: Long Papers). Association for Computational Linguistics,Berlin, Germany, 33–43. h�p://www.aclweb.org/anthology/P16-1004

[7] Alex Graves, Greg Wayne, and Ivo Danihelka. 2014. Neural turing machines.arXiv preprint arXiv:1410.5401 (2014).

[8] Xiaodong Gu, Hongyu Zhang, Dongmei Zhang, and Sunghun Kim. 2016. DeepAPI Learning. In Proceedings of the 2016 24th ACM SIGSOFT International Sym-posium on Foundations of So�ware Engineering (FSE 2016). ACM, New York, NY,USA, 631–642. DOI:h�p://dx.doi.org/10.1145/2950290.2950334

[9] Sumit Gulwani and Mark Marron. 2014. NLyze: interactive programming bynatural language for spreadsheet data analysis and manipulation. In InternationalConference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, June 22-27,2014. 803–814. DOI:h�p://dx.doi.org/10.1145/2588555.2612177

[10] Tihomir Gvero and Viktor Kuncak. 2015. Synthesizing Java Expressions fromFree-form �eries. In Proceedings of the 2015 ACM SIGPLAN International Con-ference on Object-Oriented Programming, Systems, Languages, and Applications(OOPSLA 2015). ACM, New York, NY, USA, 416–432. DOI:h�p://dx.doi.org/10.1145/2814270.2814295

[11] Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long Short-Term Memory.Neural Comput. 9, 8 (Nov. 1997), 1735–1780. DOI:h�p://dx.doi.org/10.1162/neco.1997.9.8.1735

[12] Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Ze�lemoyer. Sum-marizing source code using a neural a�ention model. In Proceedings of the 54thAnnual Meeting of the Association for Computational Linguistics, Vol. 1. 2073–2083.

[13] Svetoslav Karaivanov, Veselin Raychev, and Martin Vechev. 2014. Phrase-BasedStatistical Translation of Programming Languages. In Proceedings of the 2014ACM International Symposium on New Ideas, New Paradigms, and Re�ections onProgramming & So�ware (Onward! 2014). ACM, New York, NY, USA, 173–184.DOI:h�p://dx.doi.org/10.1145/2661136.2661148

[14] Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimiza-tion. arXiv preprint arXiv:1412.6980 (2014).

[15] Karol Kurach, Marcin Andrychowicz, and Ilya Sutskever. 2015. Neural random-access machines. arXiv preprint arXiv:1511.06392 (2015).

[16] Nate Kushman and Regina Barzilay. 2013. Using Semantic Uni�cation to GenerateRegular Expressions from Natural Language. In Human Language Technologies:Conference of the North American Chapter of the Association of ComputationalLinguistics, Proceedings, June 9-14, 2013, Westin Peachtree Plaza Hotel, Atlanta,

Georgia, USA, Lucy Vanderwende, Hal Daume III, and Katrin Kirchho� (Eds.).�e Association for Computational Linguistics, 826–836. h�p://aclweb.org/anthology/N/N13/N13-1103.pdf

[17] Vu Le, Sumit Gulwani, and Zhendong Su. 2013. Smartsynth: Synthesizingsmartphone automation scripts from natural language. In Proceeding of the 11thannual international conference on Mobile systems, applications, and services. ACM,193–206.

[18] Fei Li and H. V. Jagadish. 2014. Constructing an Interactive Natural LanguageInterface for Relational Databases. PVLDB 8, 1 (2014), 73–84. h�p://www.vldb.org/pvldb/vol8/p73-li.pdf

[19] Nicholas Locascio, Karthik Narasimhan, Eduardo DeLeon, Nate Kushman,and Regina Barzilay. 2016. Neural Generation of Regular Expressions fromNatural Language with Minimal Domain Knowledge. In Proceedings of the2016 Conference on Empirical Methods in Natural Language Processing, EMNLP2016, Austin, Texas, USA, November 1-4, 2016, Jian Su, Xavier Carreras, andKevin Duh (Eds.). �e Association for Computational Linguistics, 1918–1923.h�p://aclweb.org/anthology/D/D16/D16-1197.pdf

[20] Christopher D Manning, Prabhakar Raghavan, Hinrich Schutze, and others.2008. Introduction to information retrieval. Vol. 1. Cambridge university pressCambridge.

[21] D. G. McVitie and L. B. Wilson. 1971. �e Stable Marriage Problem. Commun.ACM 14, 7 (July 1971), 486–490. DOI:h�p://dx.doi.org/10.1145/362619.362631

[22] Lili Mou, Ge Li, Lu Zhang, Tao Wang, and Zhi Jin. 2016. Convolutional NeuralNetworks over Tree Structures for Programming Language Processing. In Pro-ceedings of the �irtieth AAAI Conference on Arti�cial Intelligence, February 12-17,2016, Phoenix, Arizona, USA., Dale Schuurmans and Michael P. Wellman (Eds.).AAAI Press, 1287–1293. h�p://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/11775

[23] Arvind Neelakantan, �oc V. Le, Martın Abadi, Andrew McCallum, and DarioAmodei. 2016. Learning a Natural Language Interface with Neural Programmer.CoRR abs/1611.08945 (2016). h�p://arxiv.org/abs/1611.08945

[24] Arvind Neelakantan, �oc V. Le, and Ilya Sutskever. 2015. Neural Programmer:Inducing Latent Programs with Gradient Descent. CoRR abs/1511.04834 (2015).h�p://arxiv.org/abs/1511.04834

[25] Rahul Pandita, Xusheng Xiao, Hao Zhong, Tao Xie, Stephen Oney, and AmitParadkar. 2012. Inferring method speci�cations from natural language APIdescriptions. In So�ware Engineering (ICSE), 2012 34th International Conferenceon. IEEE, 815–825.

[26] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU:A Method for Automatic Evaluation of Machine Translation. In Proceedings ofthe 40th Annual Meeting on Association for Computational Linguistics (ACL ’02).Association for Computational Linguistics, Stroudsburg, PA, USA, 311–318. DOI:h�p://dx.doi.org/10.3115/1073083.1073135

[27] Emilio Pariso�o, Abdel-rahman Mohamed, Rishabh Singh, Lihong Li, DengyongZhou, and Pushmeet Kohli. 2016. Neuro-Symbolic Program Synthesis. CoRRabs/1611.01855 (2016). h�p://arxiv.org/abs/1611.01855

[28] Hao Peng, Lili Mou, Ge Li, Yuxuan Liu, Lu Zhang, and Zhi Jin. 2015. Buildingprogram vector representations for deep learning. In International Conference onKnowledge Science, Engineering and Management. Springer, 547–553.

[29] Chris �irk, Raymond J. Mooney, and Michel Galley. 2015. Language to Code:Learning Semantic Parsers for If-�is-�en-�at Recipes. In Proceedings of the53rd Annual Meeting of the Association for Computational Linguistics and the7th International Joint Conference on Natural Language Processing of the AsianFederation of Natural Language Processing, ACL 2015, July 26-31, 2015, Beijing,China, Volume 1: Long Papers. �e Association for Computer Linguistics, 878–888.h�p://aclweb.org/anthology/P/P15/P15-1085.pdf

[30] Xin Rong, Shiyan Yan, Stephen Oney, Mira Dontcheva, and Eytan Adar. 2016.CodeMend: Assisting Interactive Programming with Bimodal Embedding. InProceedings of the 29th Annual Symposium on User Interface So�ware and Tech-nology (UIST ’16). ACM, New York, NY, USA, 247–258. DOI:h�p://dx.doi.org/10.1145/2984511.2984544

[31] M. Schuster and K.K. Paliwal. 1997. Bidirectional Recurrent Neural Networks.Trans. Sig. Proc. 45, 11 (Nov. 1997), 2673–2681. DOI:h�p://dx.doi.org/10.1109/78.650093

[32] Ilya Sutskever, Oriol Vinyals, and �oc V Le. 2014. Sequence to sequencelearning with neural networks. In Advances in neural information processingsystems. 3104–3112.

[33] Oriol Vinyals, L ukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geo�reyHinton. 2015. Grammar as a Foreign Language. In Advances in Neural InformationProcessing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, andR. Garne� (Eds.). Curran Associates, Inc., 2773–2781. h�p://papers.nips.cc/paper/5635-grammar-as-a-foreign-language.pdf

[34] Song Wang, Taiyue Liu, and Lin Tan. 2016. Automatically learning semanticfeatures for defect prediction. In Proceedings of the 38th International Conferenceon So�ware Engineering. ACM, 297–308.

[35] Yonghui Wu, Mike Schuster, Zhifeng Chen, �oc V. Le, Mohammad Norouzi,Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Je�Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, �ukasz Kaiser, Stephan

11

Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, GeorgeKurian, Nishant Patil, Wei Wang, Cli� Young, Jason Smith, Jason Riesa, AlexRudnick, Oriol Vinyals, Greg Corrado, Macdu� Hughes, and Je�rey Dean. 2016.Google’s Neural Machine Translation System: Bridging the Gap between Humanand Machine Translation. CoRR abs/1609.08144 (2016). h�p://arxiv.org/abs/1609.08144

[36] Luke S. Ze�lemoyer and Michael Collins. 2005. Learning to Map Sentences toLogical Form: Structured Classi�cation with Probabilistic Categorial Grammars.In In Proceedings of the 21st Conference on Uncertainty in AI. 658–666.

12