Automatic Speech Recognition

Automatic Speech Recognition, or ASR, is software that enables recordings of spoken word to automatically be converted into text. Such technology is similar to OCR, which turns images of text into text characters that a computer can read, but instead of converting images, ASR converts audio and video, or AV files.

Commercial use
ASR has many uses. Commercially, technological companies have been implementing ASR in their software, examples being Apple’s Siri, Google’s Assistant, and Amazon’s Alexa. In these types of software Automatic Speech Recognition happens instantaneous, which makes it function as an alternative to typing, or navigating a menu. This is not the norm, however: a lot of ASR software requires the user to provide an AV file (.mp3, .wav, etc.) after which it will take a while for the software to produce results.

Machine learning
Some speech recognition software is “trained”, which means it has been provided with many AV files with perfect man-made transcriptions, to learn what sounds correspond to which letter combinations. The name for this process is machine learning. This means that a selection has to be made by humans on how to train the software. Because of this, ASR can be accent and dialect sensitive: ASR for Australian English might not work for American English.

ASR and Oral History
In the context of Oral History, ASR is often seen as ideally replacing the classic process of transcription. One gives the computer an audio file, and after a couple minutes a perfect audio transcript is returned. Unfortunately, the technology is not nearly capable of this at present. It can however be of aid in the process of transcribing, by doing the bulk of the work, after which a person has to correct mistakes, add punctuation and improve syntax. Another use of ASR has been in improving searchability in large collections of audio and video files. In cases such as the Clariah Media Suite, this means that a search query is not limited to providing results in the metadata, but in the contents of spoken word in the AV data as well.

Considerations when using ASR
When a researcher is contemplating using ASR, it is important to realize certain AV sources are more suitable than others. Here are some things to keep in mind. First, the audio quality should be high. This means that voices are clear, not echoey, and recorded close to the mouth with adequate microphones. Secondly, ASR works best on monologues. If an audio file is full of people interjecting and talking over each other, results might be confusing to read, as not all software is capable of recognizing different people by voice. Ideally, the file should have a seperate channel for every speaker. Lastly, ASR is generally not very good at dealing with accents and dialects. When dealing with migrants or rural peoples with accents that might be easy for you to understand, ASR might still majorly struggle with it. Let alone accents that are hard to understand for human outsiders.