Building a chatbot initially seems quite simple. The internet is full of examples, and it’s easy to put something together. In the real (i.e. production) world, life is more complicated.
Computer programmers make assumptions about user behavior. We establish the invariants of the system and enforce them through UIs, validation rules and error handling. Life is predictable. Unfortunately chatbots, like all bots, should behave like humans, not computer systems.
We expect chatbots to handle whatever is thrown at them. In this post I’m going to cover the things we need to think about when building a production grade chatbot. In particular, I’m going to explain why we need to build adaptive chatbot dialogs.
Intents and dialog flows
Most chatbots rely on pre-scripted dialog flows, built to meet a specific goal. Let’s take a simple example:
Bot: How can I help?
User: I need a duplicate statement
Bot: Which year?
Bot Which month?
This simple example can be served using the classic intent to dialog mapping. In this scenario, the initial phrase “I need a duplicate statement” triggers a pre scripted dialog. The dialog has two turns/slots, prompting for the month and year.
Dynamic slot filling
Even this example is not quite as simple as it seems. What happens if the conversation goes like this:
Bot: How can I help?
User: I need a duplicate statement for May 2020
Bot: Which year?
User: I just told you
If we now ask the user which month and year we look stupid. The dialog flow needs to adapt. The most common solution to this is to use a combination of Named Entity Recognition and slot filling. In this case we need to fill two “slots” - month and year. First we map the slots to prompts:
year ➞ what year?
month ➞ which month?
Next we perform Named Entity Recognition (NER) on the user’s text, trying to fill both slots. In the example above, “bank statement” triggers the dialog and NER fills the year and month slots. “what year?” and “which month?” will not be asked.
Part of Speech Tagging & Dependency Parsing
So far so good. What happens if the conversation goes like this:
Bot: How can I help?
User: I need a duplicate statement for last month
This becomes more difficult. Named Entity Recognition is unlikely to work for “last month”. We could explicitly train the NER model for this scenario, but it’s likely to be brittle. We need another approach. One approach is to first attempt NER and if this fails, fall back to Part of Speech tagging and Dependency Parsing.
Part of Speech (POS) tagging tells us that “bank”, “statement” and “month” are nouns. Dependency parsing tells us that “bank” and “statement” form a compound noun; “last” qualifies “month” and “bank statement” is the object of preposition to “month”. If we put all this together we can understand the user’s intent:
Part of Speech tagging and Dependency Parsing is much more complex than simple NER, but it’s also a lot more powerful.
Capturing the right data
Let’s take another example, this time retail. Imagine we run a clothing store, and we want to recommend products to our customers. First we ask the user some questions to understand their wants. For this example we will assume we need to fill these “slots”:
We can use a combination of Named Entity Recognition, Part of Speech tagging and Dependency Parsing to fill these slots. Hopefully system is smart enough to adapt, filing more than one slot at a time.
“I want a black dress in a size 8”
In the above example we don’t need to ask for the product, colour or size, but we still need to ask 4 additional questions.
The risk of capturing too much information
Do we really need to capture 7 pieces of information before displaying some results? We run two risks:
- We may force the user to be so specific that we can’t actually find any matches in our database.
- The dialog may be so long that the user gets bored and gives up.
The risk of capturing too little information
Maybe we decide to focus on the attributes/slots that we absolutely need. Perhaps product type, price and size. After all, there’s not much point offering someone something they can’t afford or won’t fit.
However, we have to ask ourselves what value are we adding? A simple faceted search on the e-commerce site would achieve the same results. We also run the risk of not fully understanding the need. For example, if the user asks for a dress costing less than £500, we may find hundreds of matches.
Striking the balance
Ideally we want to achieve three goals:
- Meet the user’s needs
- Offer real value, beyond that which is achievable through other means (e.g. a website)
- Stimulate and retain the user’s interest (AIDA)
The first and second goals could be achieved using a rules based approach with short-circuiting. We ask the user questions, whilst checking our database for matches. When we are able to offer “enough” results we stop asking questions and move onto displaying the results.
The third goal is not so easy to achieve. We could also use a rules based approach, maybe limiting the attributes to 3 for product one, 4 for product two etc. This is guesswork though. In reality the user’s attention span will be dictated by many factors including:
- the intent / need
- the time of the day
- the device used
- new vs repeat/loyal customer
Using machine learning to drive the dialog flow
Machine learning can help us here. During a “training” period we build the dialogs dynamically, trying different permutations of slots. Like split A/B testing on steroids. We record everything, including the time of day, drop off rate, conversions etc. This behavioural data can be used to build a machine learning model. This could be a simple regression model or something more sophisticated like a decision tree or ensemble model.
When we have a good model, we can plug it into our dialogs. We feed the same variables (product, device, time etc) into the model and ask it to predict which slots should be filled. This can of course be supplemented by a rules based algorithm. The end result is dynamic dialog flow, which is statistically proven to generate the best results.
We used retail e-commerce as an example, but this concept can be applied to any domain. Going back to our original banking example, we could ask the user if they want a paper or electronic statement? do they want a certified copy? etc. A machine learning model could predict which questions to ask to get the best results.
At a minimum our dialog flow should be smart enough to avoid asking redundant questions. We do this by filling multiple slots from each user response. We only prompt users for unfilled slots. Named Entity Recognition may not be enough. We may also need to also employ Part of Speech and Dependency Parsing for more complex concepts.
Finally, we need to think carefully about the information we capture from our users. If we ask for too little, we may be unable to offer any value. If we ask for too much we may lose their interest or struggle to return a result. We can use machine learning models to build dialog flows dynamically, delivering the best results for each individual user.