<img height="1" width="1" style="display:none" src="https://www.facebook.com/tr?id=275133924453494&amp;ev=PageView&amp;noscript=1">

Try Solvemate without risks! We are willing to guarantee your satisfaction, or your money back

Learn more

Go beyond the chatbot hype and boost your customer service with automation to improve your CX.

Discover our features to create meaningful conversations in an instant, scalable and cost effective way.

We integrate with the best! Solvemate is built to fit into your existing tools and workflows. Find out more.

Apr 1, 2021 12:45:21 PM

How to Create a Voice Chatbot


Machine learning is advancing at an incredible pace these days and it’s hard to keep up with everything. Luckily, we have you covered on a quick overview on the exciting technologies that will bring text-based chatbots into an omnichannel voice-enabled world.


Despite what you may think, most of the magic in machine learning on audio is done via images and text. But where do the images and text come from? Today, we have a bit of a technical deep-dive into machine learning on audio.


First, an audio signal is a giant 1-dimensional sequence of numbers. But what we hear depends on many of these data points together and modern machine learning has gotten very adept at learning from two dimensional data such as images, so very often one first converts the audio signal to a spectrogram. Once we have our signal as a spectrogram, we can use similar neural networks that work so well on images on our audio signal and the magic of machine learning on audio can begin. In case, you’re having a hard time visualizing this (I know I did), below is an image of me saying “Solvemate” as an audio signal and a spectrogram.


solvemate audio signal and spectrogram


In order for us to move our chatbot to the voice world, we need to do two main tasks:

      1. Listen
      2. Speak


Automatic Speech Recognition (ASR)


In the world of machine learning, the role of “listening” to what someone is saying is called automatic speech recognition or ASR for short. Even though it may seem easy, there are many things one has to take into account such as background noise and audio quality before you get to the truly daunting problems like variety of speakers, accents, diction, etc.


But a lot of ML research and datasets are open-source, so we can use state of the art models such as NVIDIA’s Quartznet to build models that can convert speech to text in realtime. We can then use this text as normal input in our bot using our existing NLP infrastructure.  


Text-to-Speech (TTS)


Now that the end-users can communicate what they want to the bot with voice, how about having the bot speak instead of presenting text to the user? That’s also possible with something called Text-to-Speech or TTS. Text-to-speech is basically the opposite of ASR. Our models begin with some words, those are converted to vectors, which act as an input into our model that creates spectrograms and finally these spectrograms are converted to the long 1-d audio signal that your speakers play. One of the state of the art text-to-speech models is called Tacotron and it’s pretty amazing what it can do. Here are some examples of speaking one language from a voice in another (the “Accent Control” section with an English speaking voice speaking Spanish at different levels of an English accent blew my mind).  

We can use such technologies to read out the text in a natural way that engages the end-user and feels natural.


Voice at Solvemate


These technologies enable us to understand people and to answer them. That’s a basic requirement for a bot, but it’s not yet enough to create a chatbot experience that people will enjoy. Text-based chatbots are optimized for written communication but humans communicate differently when they speak. 

Read more: The Battle of the Bots: NLP versus Multiple-Choice

First of all, everyone has speech disfluency, meaning they use fill-words like "um" and might not reply with a perfect sentence. On top of this, there’s an uncountable amount of accents and dialects. Verbal communication is more descriptive and never will be as precise as writing. Secondly, there might be background noises accompanying the users' words, maybe because their kids are playing lively in the same room or because they’re outside on the streets. And third, once it’s said, it’s done - people can't just go back and read again what the bot’s answer, so they need to be sure that they got all the information. 


That's why voice bots need to be more precise in the information they provide but also in the questions they ask so that the user knows exactly what to answer. At the same time voice bots need to be more fault-tolerant. They need the ability to filter out background noise and to understand people’s sentences even if they are not perfectly expressed. They need to understand the gist of it and follow-up with good questions in case something isn't clear yet.


The key ingredient for a successful voice bot is the NLP engine that interprets the users’ expressions. That's why we designed the Solvemate Contextual Conversation Engine™ that is built around three key principles:

      1. Offering easy answer choices to set expectations
      2. Understanding a wide range of user expressions instead of just matching standard intents
      3. Being able to ask smart questions to really understand the user


Our vision is to create a truly omni-channel automation platform for customer service that allows brands to offer the best experience on text- and voice-based channels, while keeping the setup and maintenance efforts low.


Customer service shouldn't be a compromise, no matter how customers reach out to a company. And companies should provide a consistent service experience and share intelligence between channels in the same way as service agents handle emails, live chats and phone calls.


More Than IVRs


Voice bots will be more than traditional IVRs (interactive voice response systems) that have been around for a long time. Typically associated with a lot of user frustration, they still don't provide much more value than putting customers off or routing them to the right service department.


The next generation of IVRs will be very different. Conversations will be much more fun and the communication will be more natural and faster. And they will be much more helpful. Just like a text-based chatbot, they will be able to 

    1. provide general information, e.g. about products a company is selling
    2. provide personalized information, e.g. about the delivery status of a customer’s last order
    3. route the customer to the right department


And as they are a part of a company’s omni-channel automation platform, it will be easy to handover conversations between different channels. Let's say a customer wants to return an item and they call the hotline. The voice bot will then ask them for the details via phone and send them the return label via WhatsApp. 


Doesn't this sound exciting? 


Text-based chatbots were just the first step of intelligent service automation. The goal is to provide the right service at the right time, no matter how customers reach out to a company.





David is a machine learning engineer at Solvemate specializing in NLP. He developed a passion for ML during his master’s program and wrote his thesis on classification on audio. That’s when he started contributing to open source software projects such as PyTorch and PyTorch Audio. Now David is excited to start working on the voice prototype of our bot.