A transcription chain is a set of concatenated "steps" one has to perform in order to go from a "recorded interview" to a findable, accessible and viewable digital AV-document with relevant metadata on the Internet.
Here below, we will explain the picture above: the various steps one may take in a transcription chain. It has to be stressed that there is no "perfect" solution. The Transcription Chain is a set of consecutive steps one may take to reach the final goal: a findable, accessible and viewable digital AV-document.
If the audio quality is good and a 100% verbatim transcription of the spoken speech is not necessary, automatic speech recognition without human check and improvement, may be enough. On the order hand, if the audio quality is poor, people speak in dialect and a verbatim transcription is necessary for a thorough dialogue analyses, human transcription with a forced alignment by the computer may be necessary.
From an interview to a digital repository
Basically, one may divide the interviews in existing (already recorded) and new (to be made) and in analogue and digital. Analogue interviews cannot be accessed in another way than “going to the place where the physical object is stored and playing it on a similar device as it was recorded”. So, to make them accessible via the Internet, analogue recordings need to be digitized. Moreover, it is not very likely that many new interviews will be recorded on an analogue device.
So in the process "From interview to a digital recording suitable for recognition / alignment" we have three possibilities.
Digitizing Analogue recordings
A simple version of the digitisation process is:
Play the analogue recording and send the analogue output via an AD-converter to a computer where it will be stored as a digital file.
This is true, but to get the best quality one need to consider a lot of “issues”.
Adjusting the playing device
The quality of the analogue playing device must be as good as possible. Using “just a tape recorder or cassette player” may introduce all kinds of (disturbing) artefacts in the digital version of the recording. A nice overview of good practice is given by Kees Grijpink of the Meertens Institute in Amsterdam.
During a visit to the Meertens Institute Kees showed how one need to open a compact cassette, clean the cassette, remove al dust, clean the tape guides and "make them going", remove, clean and replace the tape heads with alcohol , make the pressure pad working again (they dry-out and become less flexible) and, very important, adjust the azimuth of the tape head (for a nice video instruction see here).
Modern software can (partly) correct for azimuth deviations, but it is prefered to do it when one digitize the recording.
More or less the same need to be done with tapes and tape recorders.
The analogue signal will be digitized via a so-called AD-convertor. Basically, the analogue signal iseach x-milliseconds and the value of the analogue sample at that time, is stored as a number in a digital file. The smaller the interval between two sample points, the better the resulting quality, but the more disk-space a recording will need.
The amount of samples per time interval is called the Sample Frequency and is expressed in Herz (Hz). A sampling frequency of 16.000 Hz (=16 kHz) means that the sound-wave is measured 16.000 times per second.
Human Hearing and Voice
But what is an ideal sampling frequency for spoken interviews, for the human voice?
- The range of the human hearing is about 20 Hz to 20 kHz (depending on the age of the listener) and is most sensitive at 2 to 4 KHz: the domain of the human speech.
- The normal human voice range is between 80 Hz to 8 kHz
An important criterium is the so-called Nyquist sampling criterion. Without becoming too detailed (see the link for more information) one may say that if the maximum frequency of a sound wave is N (i.e. the sound wave trills N-times per second) the amount of samples / second one need to record is 2xN. The maximun range of the human voice (in speech) is ±8000Hz so the maximal sampling frequency will be around 16.000 Hz.
Higher and Higher?
For years, 16kHz with a 2-byte (=16-bits) sample was considered as the optimal sampling frequency for the recording of the human voice. Higher sample-frequencies will result in more data but not in a higher quality of the recording of the human voice.
Nearly all ASR-engines require a sampling frequency of 16 kHz. However, because memory is really cheap and plentiful available and a lot of research is done on the "improvement" of the sound quality, it may be wise to follow the advise of the European Broadcast Union and use a much higher sampling rate: 96kHz and a 4-byte (=32 bits) value.
This will increase the amount of disk-memory with a factor 12. So do it if you have enough disk-memory available and store the files in a kind of black archive. For the "daily" use (sending the recordings to collegues, listen to speech and process the recording with an ASR-engine) use the 16-16 version.
Mono or Stereo?
If the original recording was in mono, use mono. Often the default settings of recording software is set to 44.1 kHz stereo recording.
One may argue that the high sampling freqency may be usefull in the near future, but a stereo-file from a mono-recording (2x the same signal on the left and the right channel) doesn't make sense.
When the original recording was done with 2 microphones (each on a separate channel) and there is a clear difference between the left and right channel, it is wise to convert it to a digtal stereo-file.
For new recordings this is different: stereo or even better: each speaker with his/her own microphone recorded on a separate channel should be ideal.
Soundcards & Software
When converting the analogue signal into a digital equivalent, the most easy and cheap solution is to plugin the line-out from your tape/cassette recorder in the input-hole of the soundcard of your computer.
This may result in "acceptable" results, but the quality of the digital-reording heavenly depends of:
- the quality of the sound card in your computer
- the software program used for the recording
If you plan to digitize a lot of analogue recordings, it may be wise to invest a bit in a good sound card. The best, and most practical solution is an external USB-soundcard. The advantage of such a device is that:
- it is portable and you can use it with different computers
- you’ll typically get higher-quality audio processing than you get with your PC’s built-in sound card. The best of these units offer 24-bit/96kHz digital audio
There are many free and paid software packages that can be used for the recording. A nice overview of some free tools can be found here.
If you eally plan to do massive digitizing of you analogue collection, it may be wise to buy a (semi-)professional one where you can do things as azimuth correction, noise cancelling, filtering and more. Some good packages are:
vragen aan Kees