Converting spoken information into text, data or actionable insights is not an easy task. Speech technology is a game of'what is most likely to have been said here' and the winner is the speech engine that returns the most accurate results.Other features and value can be layered on top of the pure transcript, such as keyword spotting, redaction, or predictive analytics, but here we will focus on the accuracy of the transcript, as that is the basis for everything else.
There are mainly two methods to building speech recognition software; phonetic based and text based/fixed vocabulary speech engines. Phonetic based speech engines are built with a smaller grammar set and use phonemes as the basis for recognition and search, while fixed vocabulary engines are built with a larger, fixed, predefined vocabulary.
Looking to compare? Check out this blog on what to look for when comparing speech tech:
PHONETIC BASED SPEECH ENGINES
Habitually Phonetic speech engines take an audio stream and compress that stream similar to the way one can compress music into mp3 format - however it is done at a much higher compression ratio. Although there are some advantages to this method, it is worth mentioning this high compression of the audio stream is not loss-less and greatly diminishes the quality of the audio. One of the advantage of these systems is that pattern matching can be performed on those compressed files for words that sound similar. Therefore one can search for new words, out of vocabulary, and the engine will return similarly sounding results. Another advantage is that these engines are very fast at creating that searchable index/compressed audio file - so lots of content can be searched through very easily - which is why these systems have been used primarily for firehose mining. Meaning searching huge amounts of audio as it is created for certain trigger words - a very popular use case in the intelligence community.
However, Phonetic engines also have a number of significant drawbacks which makes them less ideal for non-governmental use cases. Searches tend to return an excess of false positives, so searching for “interested” will give you results where that word was spoken, but also point you to lots of other, similarly sounding words, which ends up creating more work for the person using the system. Next, while phonetic engines excel at making a file searchable very quickly, they require huge computing power and resources to search large volumes of files - for example an archive of the day or week prior. The reason being, they have to load a compressed audio file, and pattern match the pronunciation, therefore any file to be searched has to be loaded into memory. Understandably, these engines run on the more expensive side.
FIXED/LARGE VOCAB SPEECH ENGINES
The second speech technology we mentioned, a fixed/large vocabulary engine like VoiceBase, searches only take a split second (try a free trial here) and computing power is a non issue with the API structure cloud-based companies have applied. A large text vocabulary engine typically comes with the drawback that words out of vocabulary can not be found. However, VoiceBase has developed a custom-vocabulary feature that allows the addition of custom names, industry terms, or acronyms on the fly with any file or batch of files - which eliminates the biggest drawback of text/large vocabulary engines. Other benefits of a large text vocabulary speech engine are than in can be built using deep learning - neural network based logic; meaning it is incredibly flexible, continuously improving and is easy to run in the cloud, eliminating additional on-premise costs.
As it comes to text / large vocabulary engines - two technologies have been used to create the high accuracy. HMM (Hidden Markov Models) based engines has been used successfully for a long time. However these have been recently surpassed by so called deep learning - neural network based speech engines which tend to significantly outperform HMM based engines - however at the expense of taking a lot more computing time. Neural Network techniques tend to be more robust and resilient to accents and background noise, and most of the big players in this space have shifted to this approach recently.
Below is a nice video we found that gives some insight on Machine Learning and how it works.
WHICH METHOD FITS YOUR NEEDS?
A benefit often noted for Phonetic-based engines is that audio (converting to phonemes) is typically much faster than the approaches which use language models. However, searching is not as easy or accurate. While on the other hand, searching text is faster, easier and more accurate than searching phoneme streams but converting the speech to text takes is very hard to develop. Generally phonetic based engines tend to only be used in very niche use cases, while text/large vocabulary based engines tend to be the state of the art. Every use case is different, but keep in mind the features that will matter most to your business when evaluating speech technologies.