Chatbot best practices | KPIs, NLP training, validation & more

Anyone can build a chatbot. Most chatbot libraries have reasonable documentation, and the ubiquitous “hello world” bot is simple to develop. As with most things though, building an enterprise grade chatbot is far from trivial. In this post I’m going to share with you 10 tips we’ve learned through our own experience. This is not a post about Google Dialogflow, Rasa or any specific chatbot framework. It’s about the application of technology, the development process and measuring success. As such it’s most suitable for product owners, architects and project managers who are tasked with implementing a chatbot.

1. Know (and measure) your KPIs

I’m stating the obvious here, but it’s really important to know what you want to achieve, and how this can be measured. Many of your KPIs will be sector or domain specific, but I will give you some chatbot specific KPIs to think about. Listed in order of importance:

User feedback - Ask your users if they are satisfied. This can be done during the dialog flow or at the end. Asking for feedback will give you two KPIs:
1. Feedback rate - the percentage of users who provided feedback.
2. Satisfaction rate - the percentage of those users who were satisfied. I generally recommend offering a binary choice “satisfied” vs “not satisfied” instead of a rating from 1 to 5, but the choice is yours.
Bounce rate - The percentage of users who abandon conversations midway.
Self service rate - The percentage of users who were able to achieve their goal without needing to chat, phone or email a human agent.
Return rate - The percentage of users who return to use the chatbot again.
Hourly usage distribution - When are people using your bot? A bot that serves customers 24/7 is especially valuable as it’s usually prohibitively expensive to provide human cover 24/7.

There are also a couple of technical KPIs to think about:

Model accuracy - If you are using natural language understanding to understand messages you will need to measure the accuracy of your models. This is a complex topic and is best left to data scientists and experienced developers. One word of caution - don’t get too hung up on the technical measures of accuracy. Ultimately it’s the business KPIs that matter. A model that is 98% accurate is no use if users are dissatisfied with the service.
Response times - Speed is not so important for a chatbot. You can work around slow performance by adding a “bot is typing” type message to dialogs. Sometimes we even fake a delay to make the bot seem more human. Nevertheless, you want to keep an eye on performance to maintain acceptable response times.

Finally, it’s important to know which channel your users favour if you deploy an omni-channel chatbot. Being dependent on one third party channel, e.g. Facebook Messenger puts you in a vulnerable position.

2. Use human agents first

Like most technology, a “bot” is designed to automate tasks that would otherwise be done by a human operator. Before embarking on a chatbot it’s essentials that you know exactly what you are trying to automate. The best way of doing this is to first employ human agents to respond to your users’ messages. Why do this?

To validate assumptions - you (or your product owner) may have established your use cases or user stories, but how valid are they? How many users actually choose live chat to check their order status?
The 80/20 rule - users will make all sorts of requests. Even for a particular use case, some requests will be so esoteric that it makes little sense to automate. For example - you will most likely want to handle postage and delivery type queries, but does your bot really need to handle queries about VAT on BFPO shipments? probably not.
To understand the tone of conversations - How do your users interact over live chat? are they friendly or professional, do they use colloquialisms? Do they use emojis and text speak? Understanding the tone of conversations will allow you to develop a chatbot that your users are comfortable using.
To get training data - This is probably the most important reason why you should use humans first. Even if you’re not planning to use natural language understanding you will still need data for keyword matching. If you are using NLU/NLP you will certainly need good training data. From what we’ve seen, 80% of chatbots fail to meet expectations because they used synthetic training data, or a limited sample of real world data.

You may already employ human agents to serve your customers. If so, you probably need to tweak the data you log, and the way it’s structured (see below). If you don’t yet employ human agents you can actually do this on a (relatively) small scale. You don’t need to serve all your customers manually before switching to a chatbot. What you’re after is a representative sample. For example, you may display a “live chat now” button for one in 10 visitors.

3. Log (almost) everything

Ok, you need to be mindful of GDPR, so you can’t log everything. For your purposes we don’t actually need personally identifiable information. What you’re after is the phrases users use. In particular, you’re interested in:

intents -"i want to check my order status" . You can map intents to use cases/user stories.
entities - "yes, a jacket" - Entities (relevant nouns) form the basis of named entity recognition
parts of speech - "I want a black or white dress" - The adjectives, prepositions and conjunctions. You will use these to train your part of speech tagging models.
sentiment - "I’m not happy" or "thanks for your help" - You can use text classification and sentimental analysis to detect when users are satisfied or dissatisfied.

Ideally you will log conversations in a freeform database, something like elasticsearch would be great. It’s important to log not only the messages but the wider context. i.e. you want to tie messages together into a conversation threads and identify the participants (user vs agent). Log the conversations during the initial human pilot phase and also during the full implementation. You’ll want to continually evaluate, refine and improve.

You should also log key events such as clicks and conversions and along with metadata such as timestamps, device used, ip address etc. Just careful that you don’t breach GDPR via jigsaw identification

4. Make the most of your training data

If you’ve followed our first piece of advice, you should have some decent training data. Now it’s time to put it to use.

You may want to split the conversations into 3 parts:

intro - the first couple of messages e.g. "I want to check my order status" . Useful for intent and entity analysis
body - e.g. "last monday" these messages may contain additional entities
sign-off - last couple of messages e.g. "thanks for your help" Useful for sentimental analysis

Next, apply clustering on the “intro” messages to identify common intents and entities. Following the 80/20 rule this will tell you where to focus your efforts.

Finally, use the data to train and test your NLU models or keyword matching algorithms. Take care when splitting the training and test datasets.

As mentioned in the first section, you may also want to analyse the data to understand the tone of the conversations. This will be useful when thinking how to word the questions your bot will ask.

5. Think about validation and error handling

Experienced IT professionals think carefully about validation and error handling when building apps or websites. You can usually rely on the UI to help enforce constraints. For example, by using a dropdown select box with the valid options. The challenge arises when trying to enforce the same constraints in a chatbot.

Quick replies

Some channels offer “quick replies” - prefilled responses. Quick replies can be used as a means of constraining user behaviour, but should be used with care. Unlike dropdown boxes, the options are typically displayed horizontally or vertically and take up valuable screen real estate, especially on mobile devices. This makes them suitable for responses with only a few options.

In most cases you won’t be able to use quick replies. Even if they are a feasible option, a chatbot with lots of quick replies is nothing more than an app with a poor UI. As the name implies, quick replies should be used to help users respond quickly. They exist to make life easier for users, not developers.

Message validation

Free text entry is at the heart of a chatbot. It’s unconstrained, so good validation and error handling is especially important. Remember - whilst your NLU model may correctly identify an entity, this doesn’t mean your downstream systems can handle it. "100 pounds" or "last monday" are examples of entities that an NER model will probably recognise, but need transforming for downstream consumption.

6. Think how to handle more than one message

Here’s the typical chatbot flow:

Ask question
Process reply

The problem arises when your users don’t send a single message in reply to the question but several. Let’s take this simple example of a dialog between a customer and a human agent:

Agent: hello how can i help?
User: hi
User: i want to check my order status
User: order A123

Now let’s see how this might look with a naive bot:

Bot: hello how can i help?
User: hi
Bot: Sorry i dont understand
User: i want to check my order status
User: order A123

The bot asks the user a question, then tries to infer intent from the reply “hi”. It can’t make sense of this, so generates an error. There are a few workarounds:

Support chit chat

Basically you train the chatbot to recognise “chit chat” type messages, which it can either reply to or simply ignore. Taking the example above, the bot would either ignore the “hi” or reply with “hello”. Either way, it wouldn’t generate an error.

Buffer incoming messages

Buffer all incoming messages. Wait until N seconds have elapsed since the last message. At this point concatenate all the buffered messages together into a single message and process it. Taking the above example it would look like:

Bot: hello how can i help?
User: hi
User: i want to check my order status
User: order A123
(wait N seconds)
Bot: Ok …

If the channel allows, you may be able to monitor the “user is typing” notification instead, setting N to a lower value. The downside to this approach is that the user always has to wait N seconds for a response which makes the bot seem unresponsive.

Buffer but short circuit

The same approach as described above, but instead of always waiting N seconds, you try to process the message buffer every time a message is received. If you can process it, you do so immediately, avoiding delay. Going back to the contrived dialog it would look something like:

Bot: hello how can i help?
User: hi
(can’t process - wait)
User: i want to check my order status
(bingo)
Bot: Ok what is your order number?

7. Use checkpoints

As well as validating each user response, you will want to set up various “checkpoints”. This means telling the user what the bot has understood and asking them to confirm this. For example saying something like:

"I understand you want to check the status of order number A123"

Of course, you need to think carefully about how you will handle a negative response. Simply repeating the same questions again and running the answers through the same NLU model or algorithm is unlikely to work. Many chatbots ask the user to rephrase their request in the hope that it will work second time around. We think this is a poor strategy - there’s no guarantee it will work, and it’s a poor user experience.

We believe there are two approaches that will yield better results.

Drill down

We call the first strategy the “drill down” approach. Start out by asking users open questions e.g. "how can I help?" or "what are you looking for?" . Run the responses through the NLU models and algorithms and checkpoint the conversation.

If all is ok, great! If not, you move on to ask more specific, closed questions - probably with some guidance. For example "do you have a query about a return?" . You will probably use a different set of NLU models or algorithms to handle answers to these closed questions.

You wouldn’t want to start out by asking this sort of question, because closed questions result in a lengthy dialog. It’s much better for a user to say "I want a white dress in size 12" than answering multiple questions about the product, colour and size. The aim here is to gracefully handle the outliers that can’t be served via the “happy path”.

Bailout

We call the second approach the “bailout”. Put simply if you can’t understand the user’s needs you fall back to human intervention. See below for more details.

8. Augment your chatbot with human agents

What can you do with the outliers? Firstly it’s important the system recognises when it’s failing to meet the user’s expectations. For your users, there’s nothing worse than talking to brick wall. One way of detecting this is to count the number of “sorry I don’t understand” type responses generated for each dialog. As mentioned above, checkpointing is also very important.

You can’t expect your chatbot to be perfect, and it doesn’t have to be. There will be cases where the chatbot doesn’t understand the user due to an imperfect NLU model or algorithm. There will be instances where the bot simply lacks the business logic to fulfil the users request.

Providing a fallback or “bailout” to human agents is a great way of handling these edge cases. You’re not trying to create the perfect chatbot, even if such a thing were possible. You’re aiming to get the best return on your investment. These esoteric edge cases can be handled by a relatively small pool of human agents. What’s more, the conversations between the users and agents should be logged and will feed into your continuous improvement plan.

You don’t necessarily need to offer live chat style support either. It may be enough to ask the user to email your sales or customer service team with their request.

9. Adopt a continuous improvement plan

The reason you’re logging the conversations is to build up training data, allowing you to build accurate models. To borrow a cliché - this is a process not an event. Whilst the data captured during the initial “human” stage gets you started, you need to retrain the models as you collect more data.

You may discover that your users interact quite differently with your bot vs human agents. Decades of Googling have conditioned people into using a terse form of language. Language intended to help the system understand their query. For example a user may tell a human agent "a white or cream cotton shirt" but tell the bot simply "cotton shirt white" .

It’s also important to keep an eye on the KPIs and metrics.

10. Look for opportunities

Chatbots are freeform, users can say whatever they like. This presents challenges but also opportunities. Chatbots are great for market research.

Take for example, an e-commerce site for a clothing merchant. While viewing a dress the user can choose the colour: red or white. How can the user give feedback that they would like the same dress in black? They could fill out a feedback form or send an email, but they’re unlikely to do this. More likely they will look for another dress or choose another merchant.

In contrast, an e-commerce bot could ask "what colour?" to which the user will reply "black" . The bot would need to tell the user that the dress is only available in red and white. However, it can suggest a similar dress that’s available in black. Crucially the bot has captured the demand for a black version of the dress. If enough users ask for black the buyers may decide its worth offering it next season.

Summary

Implementing an enterprise grade chatbot requires careful planning. It’s important to understand the KPIs and business drivers before embarking on the project. Having a means of measuring success is also really important.

Getting suitable training data is essential and one of the best ways of doing this is to use human agents first. Careful logging and monitoring will allow you to improve the accuracy of your chatbot over time. As with all software applications, validation and error handling is very important. Chatbots have the potential to misunderstand users, so checkpointing is a useful double check.

Be prepared to adapt and evolve quickly, especially during the early days. Use A/B testing to find the dialog flows that work best. Retrain the NLU models as you collect more training data. Look for opportunities - are users asking for use cases you’ve missed? Is there demand for a product or variant you’re not yet selling.

If you found this useful you might also be interested in an article about building robust chatbot dialogs.