Key takeaways from ICML 2019

Insertion Transformer: Flexible Sequence Generation via Insertion Operations

The 2019 International Conference on Machine Learning was held at the Long Beach Convention Center in June. The team and I attended several talks and short presentations from which we pulled some interesting highlights to share with you.

The presentation we’ll discuss in this article is Insertion Transformer: Flexible Sequence Generation via Insertion Operations authored by Mitchell Stern, William Chan, Jamie Kiros, and Jakob Uszkoreit.

Mitchell Stern gave an excellent presentation on his use of insertion transformers for sequence generation (German translation for his case). Transformer models were a hot topic at the conference, and this particular talk gave a great overview of why transformers offers improvements to certain tasks. The speaker also presented great examples of how the insertion transformer works at a high level and compared it with alternative approaches (e.g., autoregressive predictors).  

He began his presentation with basic examples of the normal autoregressive predictions made by most sequence models, and why they may give sub-optimal results. Stern used the example “three friends ate lunch together”. You would begin with an empty list, and then after the first iteration you would have a list with one element: “three”.  The next pass would append “friends” to the list, and then “ate”, “lunch”, and finally “together.”

Non-autoregressive models have been presented by previous work (Gu et al., 2017; Lee et al., 2018), but their results gave lower quality outputs and required explicit length modeling.  Insertion-based models aim to solve both of these deficiencies. The Insertion Transformer (which is the star of the presentation) evaluates the probability p(word | slot, context) at each pass. For the same example as above, it might append “ate” to the empty list after the first pass. And then on the second pass, it might add “three” to the beginning of the list and “together” at the end (i.e., it can add multiple words during each iteration in different slots of the list). During the third and final pass it might fill in “friends” and “lunch” to the list, resulting in the same list [“three”, “friends”, “ate”, “lunch”, “together”]. Note, during this final step it also appends the end-of-sentence token at the end of the list.  

The model is trained to be able to complete any partial hypothesis (as opposed to the loss from the next generated token in the sequence or a uniform loss). During training a random subset size is sampled (k ~ Uniform([0, 1, 2, …… n]), and a random subset of k tokens is sampled to obtain partial output. Loss is computed for a single insertion step. 

The model uses balanced binary tree loss,  which I’ve never heard of before, but it essentially aims to give higher weight to words closer to the center of each slot. There’s also two loss terms (an additional one for slot-loss) (see slide photo).

Inference can be done either sequentially or in parallel (like the example above). During sequential inference the decoding is done by picking the best content-location pair:

On the other hand, during parallel inference, decoding is performed by choosing the best content for each location:

He concludes his talk by going over the inference results of the model (using German translation as an example). They compare their results with different loss functions (binary tree, uniform, or left-to-right sequential), different model architectures, and earlier studies that don’t use the Insertion Transformer. They found that the binary tree loss with the parallel inference gave the best results. However, the results weren’t drastically different than the blockwise parallel method of Stern et al. (2018). Nonetheless, these results are promising for any application that would require a sequence-to-sequence model. 

Overall, the presentation was very interesting, especially with how they applied non-autoregressive methods to sequence predictions (I have not seen that done before). These methods could be applied to the general chatbot that was recently developed at Panoramic. The current chatbot uses a vanilla sequence-to-sequence model; however, an upgrade to transformer layers (or Insertion Transformer in this case) could potentially make a substantial improvement to the model’s question-answer responses.