Building a chatbot to answer data-related questions

The Panoramic AI Research team has developed a simple chatbot to interface with clients. The main purpose of the chatbot is to provide customers with simple top-level numbers that might get lost in some of the charts and tables within our storyboards. The chatbot consists of three components: 

  • a (true-false) classifier that determines if the user’s question is queryable from our database (true) or not (false). 
  • a SQL query builder based on the user’s input, if the input is queryable (i.e., if the answer to the first component is true).
  • a general chatbot designed to have a simple back and forth conversation with the user if the input is not queryable. 

These three components are joined to allow the user to both ask specific questions about their data within our system and to converse with the chatbot (Figure 1).

Figure 1. General schematic of the three components of the chatbot:
1) Text Binary Classifier, 2) SQL Query Builder, and 3) General Chatbot.

The first component is a binary text classifier that determines if the user’s input is a question that can be converted to a SQL query or not. The output of the classifier is either true or false. The classifier is a deep neural network that consists of two LSTM (Long Short Term Memory) layers and one fully-connected layer with one output unit (Figure 2). The input text first gets converted to a matrix of ones (1) and zeroes (0), based on the sequence of letters in the input. 

This matrix then gets passed through the neural network, starting with the LSTM layers. The LSTM layers are ideal for this specific task as they were designed to have built-in memory for sequential data like natural language text. The network ends with a fully-connected layer, with one output that is either 1 for true or 0 for false.  

Figure 2. Schematic diagram of the binary text classifier.
The LSTM layers consist of 30 blocks each, and the output of the fully-connected layer is one block that determines the true-false classification.

This network was trained on artificially generated queries that use certain templates (e.g., What was the CPM for Logan Lucky in the last month?); these artificial queries have a label of true. The network was also trained on random comments on Reddit, which were labeled false. Half of the training data comes from the artificially generated queries and the other half comes from the Reddit comments. A total of 60,000 training examples were created and the training iterates over the examples 20 times to converge. Once the text classifier is trained, it determines if the user input is queryable. 

If the user input is queryable (true), the second component, a “rule-based” query builder, is utilized.  This query builder extracts three pieces of information from the query: the account, the metric, and the timeframe. The algorithm scans the query looking for certain predefined values and makes a best guess as to what the value of these three variables should be. These values are then converted to a SQL query, and the results are put into a simple sentence structure and returned to the user. This simple text-to-query algorithm is limited by only a few types of questions, and we plan to build out this component more in the coming weeks. 

Figure 3. Examples of input-output of questions that query the database.

The third component is a general chatbot and gets used if the first component returns false. The general chatbot uses a sequence-to-sequence deep learning model that is designed to output text based on an input text and is commonly used for chatbots and language translation systems. 

The sequence-to-sequence model has an encoder (consisting of one LSTM layer), a latent layer (that stores semantic meaning), and a decoder (which converts the latent layer into an output via another LSTM layer and a fully connected layer). Similar to the binary text classifier, the first step is to convert the input text to a matrix of ones (1) and zeroes (0) that represent the sequence of letters. The output also goes through a similar conversion during training and within the chatbot. 

Figure 4. Examples of question-responses from the general chatbot.

This model was trained on over 100,000 input-response pairs from movie scripts, using the Cornell movie script dataset. Similar datasets were also constructed from Reddit and Facebook comment-response pairs, but we found that the movie script training yielded the best results. The sequence-to-sequence model iterates through the examples 200 times during training. After training, the model is able to input text and output text based on the movie lines that it was trained on. It should be noted, that the resulting general chatbot does reasonably well with some basic back-and-forth question-answers, but it is not perfect and strange answers have already been discovered. 

The text binary classifier, the SQL generator, and the general chatbot comprise the three components of our newly constructed chatbot (currently called skybot). Improvements to the model are planned in the coming weeks, which include increased capabilities for the SQL generator (i.e., query more platforms and more complex calculations via Query Builder). We also plan to improve the general chatbot by including more complex layers (e.g., using Transformer Networks or layers that include both attention model and bi-directional encoders). 

The chatbot can be found here.