INTRO TO SPEECH TECH
Speech technology is a game of 'what is most likely to have been said here' and the winner is the speech engine that can predict the results most accurately. There are two main methods to building speech recognition software; phonetic based and text based/fixed vocabulary speech engines.The first thing you need to know when comparing speech engines is which type are you dealing with? Phonetic based speech engines are built with a smaller grammar set and use phonemes as the basis for recognition and search, while fixed vocabulary engines are built with a larger, fixed, predefined vocabulary. To learn more about the difference of these two methods, check out our previous blog; Phonetic vs. Fixed Vocabulary Speech Technology.
Generally phonetic based engines tend to only be used in very niche use cases, while text/large vocabulary based engines tend to be the state of the art. Every use case is different, but keep in mind the features that will matter most to you when starting this process.
WHAT SHOULD YOU BE LOOKING FOR?
'WER' (Word Error Rate) and 'Word Accuracy' are the best measurements to take when comparing two accuracies, these are typically values in % and are derived by comparing a reference transcript with the ASR transcript (or hypothesis) for the audio. The algorithm used is called the Levenshtein distance, it is calculated by aligning the reference with hypothesis and counting the words that are Insertions, Deletions and Substitutions (and correct). Basically you will use this method to compare a machine transcription from each speech engine to a perfect human transcription of that file.
For keyword spotting accuracy, which is important to measure since that is what many people use transcription for, you should be using precision and recall. These are standard measures used in information retrieval science. Recall is % of the words you are looking for that were found (so 80% means we found 8 out of 10 were found and 2 were missed). Precision is the % of the hits we found actually were valid hits (so 90% means 9 out of results in the list were true and one was a false positive). This is important to measure in addition to the WER and Word Accuracy as the most important words to get transcribed correct are the terms you need to spot or search for. If a speech engine can not recognize xfinity or comcast and those are important terms for your use case, the other accuracy is irrelevant.
TIPS TO GET THE BEST RESULTS:
When comparing transcripts there is some pre-processing you can do of the text in both the reference transcript and the hypothesis transcript to make them easier to compare. For example converting everything to lowercase, removing speaker turns and punctuation can help the raw accuracy comparison, especially when the results are very close. Accuracy of the reference transcripts become more of a factor as the accuracy levels increase. At low accuracy levels these errors are small enough to get lost in the noise.
There can also be issues with word forms that are small factors when accuracy levels are low but become more of an issue with higher accuracy levels. For example:
- Number formats (10 or ten)
- Acronym formats ATT or AT&T
- Word forms/spellings (voicebase.com or voicebase dot com)
The best thing to do is to identify all of these possible terms in your recording (or reference transcript) and do a search and replace on all of the identified terms to make them a uniform format.
Once you've gotten past those hurdles and you know what to look for, you're ready to get started testing with these 6 steps:
Step 1: Identify The Right Recordings
Find a set of recordings that are representative of the audio you will be working with. Be sure this content has all of the unique terms, numbers (account number, phone number, PCI, SSN, address, etc) acronyms, etc that you will need to spot in order to get the best comparison.
Step 2: DO NOT COMPRESS
For the best results do not over compress, or better yet, do not compress the audio at all, this will just dilute the accuracy levels of each engine and give you poor results. The higher the data rate and the higher the frequency the better the results. For example; recordings under 16kHz tends to yield much poorer results, we recommend 44 kHz or better for high quality results. Generally 16khz is great. Often however you are limited to whatever your system output is. If you record phone calls you are typically stuck to 8kHz (unless your calls use some of the more advanced 16kHz codecs) - so if you have 8kHz calls, recording them in 16 kHz will not improve things. However if your system supports the new 16kHz codecs you could see a solid improvement.
If possible use lossless codecs or just plain PCM. (This applies to production as well as testing, so if your current process is to compress everything, start thinking of a Plan B.). Using PCM however means 6 to 10x larger file size. Using an mp3 file over a PCM-wav file, we notice an increase in word error rate by typically 1-2% points - 10% is typically noticeable. So if you like to save 6 to 10x file size and are fine with a 1 to 2% hit in accuracy that’s an option.
Step 3: Human Transcripts To Compare
You'll need to obtain plain text reference transcripts for each test file. There are many vendors out there who you can pay for this. It is important to note this is different than human tagging or scoring, you will need a full readable transcription, not just check marks of what was said.
Step 4: Individual Machine Transcripts
For all ASR (Automatic Speech Recognition) engines under test, you'll need to obtain plain text transcripts for each test file. Basically you need to run each file through every speech engine you're testing and download a plain .TXT file of the results.
Step 5: Run Test Comparisons
This can be done using SCLITE which is a NIST software that are in the public domain. SCLITE is part of Speech Recognition Scoring Toolkit (SCTK). If you do not have access to that software, VoiceBase sales engineers can run your speech results from different vendors through our assessment systems to provide you with the results.
Step 6: Review Results
Compare the pros and cons of the data points we outlined earlier; WER, Word Accuracy, Speed of Results and Cost to determine which speech recognition fits the needs of your content.Here are some other features you may want to compare as well:
- Number formatting (phone numbers, addresses, zip codes, SSN, etc)
- Redaction (The ability to remove sensitive data such as PCI, PII, SSN)
- Custom Vocabulary (The ability to add acronyms, pronouns and names to a unique dictionary on the fly)
- Auto Call Classification/Disposition (The ability to spot events in a recording such as a hot lead, upset customer, appointment made or an agent that needs training).
WHAT ARE YOU REALLY LOOKING FOR?
Many businesses look for transcription and speech to text in order to unearth something else in their recordings; angry customers, appointments made, rude agents, etc. Transcription is a means to an end, a means to find the word, phrase or event you're really interested in. If this is the case, instead of measuring accuracy, measure how well the speech technology can spot the important events in your spoken content, such as 'customers about to cancel' or 'hot leads'. Because it doesn't matter how good a transcript is, when what you care about spotting are really events that are difficult to find in any transcript.
Curious how this works? Here's a quick video below describing Predictive Insights: