Automatic Speech Recognition

Automatic Speech Recognition, or ASR, is software that enables recordings of spoken word to automatically be converted into text. Such technology is similar to OCR, which turns images of text into text characters that a computer can read, but instead of converting images, ASR converts audio and video, or AV files.

Automatic speech recognition has been developing for around 70 years, from recognizing handful of single words such as digits, math operators and calculator commands carefully pronounced by a single speaker, through short sentences composed from thousand of given words with very restricted structure, to today virtual intelligent assistants, dialogue systems, and fully automatic transcription of (e.g.) conference talks.

The first ASR systems from 1950’s were based on acoustic models of speech. They were able to determine the vowels and consonants by their characteristic spectrum of low and quiet frequencies. They were later enhanced into a robust system of handcrafted rules, finite-state automata and brute-force search into a single-purpose domain dependent systems. Later, powerful statistical systems such as Hidden Markov Models and n-gram language models were introduced. They allowed to train ASR systems from corresponding examples of speech and text fully automatically. The domain, language, dialect, and individual speaker varieties are covered by the system, if they are significantly represented in the training data.

Nowadays, the statistical models are replaced by Deep Neural Networks (DNNs) which amaze us of doing the same job better, so the overall quality is sometimes on par with humans. DNNs for end-to-end ASR are capable to find their own way to transform speech into text, they learn to cope with acoustics, phonetics, grammar, vocabulary, real-world knowledge and orthography without direct human supervision. They only need the training data, appropriate design and powerful computers.

Machine Learning
Speech recognition software is “trained”, which means it has been provided with many AV files with perfect man-made transcriptions, to learn what sounds correspond to which letter combinations. The name for this process is Machine Learning ()ML. This means that a selection has to be made by humans on how to train the software. Because of this, ASR can be accent and dialect sensitive: ASR for Australian English might not work for American English.

ASR and Oral History
In the context of Oral History, ASR is often seen as ideally replacing the classic process of transcription. One gives the computer an audio file, and after a couple minutes a perfect audio transcript is returned. Unfortunately, the technology is not nearly capable of this at present. It can however be of aid in the process of transcribing, by doing the bulk of the work, after which a person has to correct mistakes, add punctuation and improve syntax. Another use of ASR has been in improving searchability in large collections of audio and video files. In cases such as the Clariah Media Suite, this means that a search query is not limited to providing results in the metadata, but in the contents of spoken word in the AV data as well.

Considerations when using ASR
When a researcher is contemplating using ASR, it is important to realize certain AV sources are more suitable than others. Here are some things to keep in mind. First, the audio quality should be high. This means that voices are clear, not echoey, and recorded close to the mouth with adequate microphones. Secondly, ASR works best on monologues. If an audio file is full of people interjecting and talking over each other, results might be confusing to read, as not all software is capable of recognizing different people by voice. Ideally, the file should have a seperate channel for every speaker. Lastly, ASR is generally not very good at dealing with accents and dialects. When dealing with migrants or rural peoples with accents that might be easy for you to understand, ASR might still majorly struggle with it. Let alone accents that are hard to understand for human outsiders.