A transcription chain is a set of concatenated "steps" one has to perform in order to go from a "recorded interview" to a findable, accessible and viewable digital AV-document with relevant metadata on the Internet.
Here below, we will explain the picture above: the various steps one may take in a transcription chain. It has to be stressed that there is no "perfect" solution. The Transcription Chain is a set of consecutive steps one may take to reach the final goal: a findable, accessible and viewable digital AV-document.
If the audio quality is good and a 100% verbatim transcription of the spoken speech is not necessary, automatic speech recognition without human check and improvement, may be enough. On the order hand, if the audio quality is poor, people speak in dialect and a verbatim transcription is necessary for a thorough dialogue analyses, human transcription with a forced alignment by the computer may be necessary.
From an interview to a digital repository
Basically, one may divide the interviews in existing (already recorded) and new (to be made) and in analogue and digital. Analogue interviews cannot be accessed in another way than “going to the place where the physical object is stored and playing it on a similar device as it was recorded”. So, to make them accessible via the Internet, analogue recordings need to be digitized. Moreover, it is not very likely that many new interviews will be recorded on an analogue device.
So in the process "From interview to a digital recording suitable for recognition / alignment" we have three possibilities.
Digitizing Analogue recordings
A simple version of the digitisation process is:
Play the analogue recording and send the analogue output via an AD-converter to a computer where it will be stored as a digital file.
This is true, but to get the best quality one need to consider a lot of “issues”.
Adjusting the playing device
The quality of the analogue playing device must be as good as possible. Using “just a tape recorder or cassette player” may introduce all kinds of (disturbing) artefacts in the digital version of the recording. A nice overview of good practice is given by Kees Grijpink of the Meertens Institute in Amsterdam.
During a visit to the Meertens Institute Kees showed how one need to open a compact cassette, clean the cassette, remove al dust, clean the tape guides and "make them going", remove, clean and replace the tape heads with alcohol , make the pressure pad working again (they dry-out and become less flexible) and, very important, adjust the azimuth of the tape head (for a nice video instruction see here).
Modern software can (partly) correct for azimuth deviations, but it is prefered to do it when one digitize the recording.
More or less the same need to be done with tapes and tape recorders.
AD-conversion
The analogue signal will be digitized via a so-called AD-convertor. Basically, the analogue signal is {tip title="Sampling" content="Sampling means measuring the value (the amplitude) of a sound-wave at a certain time. If a 2 byte number is used for the sample of the sound-wave, the min and max value of the sound-wave will be between -32768 and 32767." class="blue"}sampled{/tip} each x-milliseconds and the value of the analogue sample at that time, is stored as a number in a digital file. The smaller the interval between two sample points, the better the resulting quality, but the more disk-space a recording will need.
The amount of samples per time interval is called the Sample Frequency and is expressed in Herz (Hz). A sampling frequency of 16.000 Hz (=16 kHz) means that the sound-wave is measured 16.000 times per second.
Human Hearing and Voice
But what is an ideal sampling frequency for spoken interviews, for the human voice?
- The range of the human hearing is about 20 Hz to 20 kHz (depending on the age of the listener) and is most sensitive at 2 to 4 KHz: the domain of the human speech.
- The normal human voice range is between 80 Hz to 8 kHz
Nyquist
An important criterium is the so-called Nyquist sampling criterion. Without becoming too detailed (see the link for more information) one may say that if the maximum frequency of a sound wave is N (i.e. the sound wave trills N-times per second) the amount of samples / second one need to record is 2xN. The maximun range of the human voice (in speech) is ±8000Hz so the maximal sampling frequency will be around 16.000 Hz.
Higher and Higher?
For years, 16kHz with a 2-byte (=16-bits) sample was considered as the optimal sampling frequency for the recording of the human voice. Higher sample-frequencies will result in more data but not in a higher quality of the recording of the human voice.
Nearly all ASR-engines require a sampling frequency of 16 kHz. However, because memory is really cheap and plentiful available and a lot of research is done on the "improvement" of the sound quality, it may be wise to follow the advise of the European Broadcast Union and use a much higher sampling rate: 96kHz and a 4-byte (=32 bits) value.
This will increase the amount of disk-memory with a factor 12. So do it if you have enough disk-memory available and store the files in a kind of black archive. For the "daily" use (sending the recordings to collegues, listen to speech and process the recording with an ASR-engine) use the 16-16 version.
Mono or Stereo?
If the original recording was in mono, use mono. Often the default settings of recording software is set to 44.1 kHz stereo recording.
One may argue that the high sampling freqency may be usefull in the near future, but a stereo-file from a mono-recording (2x the same signal on the left and the right channel) doesn't make sense.
When the original recording was done with 2 microphones (each on a separate channel) and there is a clear difference between the left and right channel, it is wise to convert it to a digtal stereo-file.
For new recordings this is different: stereo or even better: each speaker with his/her own microphone recorded on a separate channel should be ideal.
Soundcards & Software
When converting the analogue signal into a digital equivalent, the most easy and cheap solution is to plugin the line-out from your tape/cassette recorder in the input-hole of the soundcard of your computer.
|
This may result in "acceptable" results, but the quality of the digital-reording heavenly depends of:
- the quality of the sound card in your computer
- the software program used for the recording
Sound cards
If you plan to digitize a lot of analogue recordings, it may be wise to invest a bit in a good sound card. The best, and most practical solution is an external USB-soundcard. The advantage of such a device is that:
- it is portable and you can use it with different computers
- you’ll typically get higher-quality audio processing than you get with your PC’s built-in sound card. The best of these units offer 24-bit/96kHz digital audio
Software
There are many free and paid software packages that can be used for the recording. A nice overview of some free tools can be found here.
If you eally plan to do massive digitizing of you analogue collection, it may be wise to buy a (semi-)professional one where you can do things as azimuth correction, noise cancelling, filtering and more. Some good packages are:
vragen aan Kees
Converting excisting digital recordings
The last years, many anlogue interview collections were digitized. It may be the case that the end-result is sub-optimal, and it should be better to start-over again, but here we will handle just the transformation from one existing format into an other.
There are three major groups of audio file formats:
- Uncompressed audio formats,
such as WAV, AIFF, AU or raw header-less PCM; - Formats with lossless compression,
such as FLAC, Monkey's Audio (filename extension.ape
), WavPack (filename extension.wv
), TTA, ATRAC Advanced Lossless, ALAC (filename extension.m4a
), MPEG-4 SLS, MPEG-4 ALS, MPEG-4 DST, Windows Media Audio Lossless (WMA Lossless), and Shorten (SHN). - Formats with lossy compression,
such as Opus, MP3, Vorbis, Musepack, AAC, ATRAC and Windows Media Audio Lossy (WMA lossy).
Uncompressed audio formats
The first one is just a simple array of integers. The signal was sampled with a certain sample frequancy (see tab From analogue to digital) and each consecutive sample is stored in the array. The difference in the uncompressed formats is the way the signal is written (big endian or small endian) and the header (=metadata) stored in the audio-file.
This metadata makes it possible for a software programs to "know" if the recording was in mono or stereo, what the sample frequency was , how many bits are used for one sample and how many audio-samples there are. Moreover, additional information can be stored such as the owner, the software/hardware used for the recording and more. These uncompressed audio formats require the most disk-space but are "fast" because no additional calculation has to be done for reading and writing the files and they have the highest audio quality.
Lossless Compression
The amount of disk space necessary for your audio files may be an issue. With lossless compression, audio files are written in a "smart" way so they absorb less space but do not lose any information. From a lossless compression you alway can go back to the umcompressed format and no information will lost.
The disadvantage is that you need some computing to open the file (for reading, editing or playing) and to write the file back to the hard disk. So for long time storage when there is no need to access the files very often, lossless compression is an optimal choice.
Lossy Compression
The human ear is not lineair-sensible: i.e. we hear the difference between 100Hz and 110Hz quite well but not the difference between 6000Hz and 6010Hz. So, it is possible to compress the sound with a lossy data compression: a data encoding method that use inexact approximations and partial data discarding to reduce file sizes significantly, typically by a factor of 10 without a huge loss of audio quality. However, the more we compress, the better we do hear it.
For listening to the files, this partial loss of quality is not a big thing: we still can perfectly hear and understand what someone is telling in an interview. For the automatic recognition however, a strong compression may increase the word-error-rate.
So the best thing to do is: use the original quality (uncompressed or lossless compressed) for ASR, Aligment and (eventually) non-verbal analyses and use the compressed versions for access via the internet
Conversion on the web
There are various (good) webservices that convert your audio into other formats.
Audio Converter | media.io |
.
New Digital recordings
For new recordings it makes sense to create a recording situation, optimized for technology such as ASR, Aligment, Emotion Detection, Facial Expression Analyses and more.
Some small guidelines:
- record each speaker on a separate audio channel via a separate microphone
- record the speech with a high sample frequency and a 4-bit sample value (not 16-16-mono but 96-32-channel-per-speaker)
- use microphones that have a more-or-less fixed distance to the mouth
- use microphone that mute as much as possible the sound from other sources that the mouth of the speaker
The benefits of the approach mentioned here are great. Separate channels per speaker makes it possible to do automatic turn-taking, it prevents that a louder speaking person "overrules" a softer speaking person and the speech can be transcribed even if people are talking together.
Transcriptions
Transcription is a translation between forms of data, most commonly to convert audio-visual recordings to text in qualitative and quantitative research. It should match the analytic and methodological aims of the research. Whilst transcription is often part of the analysis process, it also enhances the sharing, disclosure and reuse potential of research data.
Full transcription is recommended for data sharing.
If the transcription is done with ASR or Forced Aligment, each transcribed/spoken word will automatically get a start- and end-time. This makes it possible to access the AV-files directly on the word-level: clicking the selected fragment in the search window may result in playing that fragment aloud.
Separation of content and presentation
Transcripts contain (a lot of) information that can be parsed by computers and humans. Human parsing is robust for small errors but computer parsing is not.
The content is therefor best written in XML (or JSON) using UTF-8. XML enforces a structured way of storing the the data, making it possible to unambiguously parse the transcripts with a computer.
Storing the transcripts in a text-editor format (e.g. docx or pdf) is therefor not recommended. Small, nearly noticable, errors may disable the parsing of the transcript. For example by using a less suitable font: Rl and RI look the same when using the helvetica-font (but clearly different when using the courier-font Rl and RI).
The same is true for the use of a hard-return (RETURN) and a soft-return (SHIFT-RETURN). For the human eye it looks the same but not for a computer, so parsing may go wrong.
XSLT
When presenting the transcripts, XSLT-files can be used to generate a human-readable document that
- shows just the information that is desired (for example all information or only the text of the transcript)
- presents the information in the look-and-feel of the institution (font, size, colours, etc.) including logo's and standard text.
Finally, if the layout of the transcripts need to be modified, only one XSLT-file need to be changed (in stead of hundreds of word-files).
Use of transcripts in third-party software
When planning the structure of the transcription template, best practice is to:
- Consider compatibility with the import features of qualitative data analysis software. Which information is needed (a must) or nice-to-have in that particular analysis software package and which information can not be used (so does it make sense to collect that info in the transcription documents?). Again, an XSLT-file can be used to generate XML-files that can be imported in the third-party software.
Moreover, different XSLT-files can be used to generate different export-files for different third-party software (for example one XSLT for export to AtlasTI, another XSLT for export to MaxQDA). - Write transcriber instructions or guidelines to get consistancy in the transcripts, especially when different people make or correct the transcriptions. How to deal with non-verbal, not-understandable or inauditable speech? How to write foreign or dialect words? How to mark sensitive information for later anonymisation?
- Provide a translation or at least a summary of each interview in English, when the speech is in another laguage.
- Never trust the transcription results of ASR-software (automatic speech recognition). ASR becomes better and better but the software cannot recognise words that are not in the vocabularct (jargon, foreing and dialect words, acronyms, abreviations, etc. ).
Transcription methods
Transcription methods depends very much upon your theoretical and methodological approach, and can vary between disciplines.
- A thematic sociological research project usually requires a denaturalised approach, i.e. most like written language (Bucholtz, 2000), because the focus is on the content of what was said and the themes that emerge from that.
- A project using conversation analysis would use a naturalised approach, i.e. most like speech, whereby a transcriber seeks to capture all the sounds they hear and use a range of symbols to represent particular features of speech in addition to the spoken words; for example representing the length of pauses, laughter, overlapping speech, turntaking or intonation.
- A psychosocial method transcript may include detailed notes on emotional reactions, physical orientation, body language, use of space, as well as the psycho-dynamics in the relationship between the interviewer and interviewee.
- Some transcribers may try to make a transcript look correct in grammar and punctuation, considerably changing the sense of flow and dynamics of the spoken interaction. Transcription should capture the essence of the spoken word, but need not go as far as the naturalised approach. This kind of transcripts is, in combination with forced alignment, often used for the automatic generation of subtitles.
Reference: Bucholtz, M. (2000) The Politics of Transcription. Journal of Pragmatics 32: 14391465.
(this text is partly based on the information on the UK Data Service website)
Metadata
Depending on the context of how an interview is created, as part of a collection that is kept in an archive, or as research material for an individual researcher, a number of characteristics will have been attributed to the recording, we call that 'meta-data'.
In the first case, the archival context, attributing meta-data is standard procedure.
In the second case, interviewing to publish a PhD thesis or an article, it depends on the kind of discipline the scholar is familiarized with whether this is required and whether re-use is considered as an option.
Meta-data tell us something about who, when, and why the interview was created. If the meta data are created in a systematic way and abide to a standard, the interviews they refer to become searchable and can be processed with digital tools.
The possibilities of digital technology to open up oral history archives and support oral history scholarship, can only be fully exploited if both kind of interview material abide to the principle of metadata. This is why we encourage both kind of scholars, librarians/archivists as well as historians, anthropologists, linguists and sociologists, to attribute meta-data to their interviews and to abide to a standard.
Metadata Schemas
Data Documentation Initiative (DDI) | XML |
social science data, mandatory and optional metadata elements for study description, data file description and variable description, codebook version (DDI2 or DDI-C) and lifecycle version (DDI3 or DDI-L) | |
Dublin Core (DC) | XML |
basic, generic, discipline-agnostic, web resources, 15 (optional) metadata elements | |
Text Encoding Initiative (TEI) | XML |
for mark-up of textual data, e.g. turn takers in speech, typos, formatting text on screen | |
Data Cite | XML, RDF |
publishing digital datasets with persistent identifier (DOI), five mandatory and multiple recommended/optional elements, discipline-agnostic | |
ISO 19115 | XML |
geographic information | |
QuDex (Qualitative Data Exchange) | XML |
rich file-level description, document coding and annotation and intra-collection relationships. Allows identification of data objects such as: Interview transcript or audio recording etc.; Relationship to another data object or part of data; Descriptive categories at the object level, e.g. interview characteristics, interview setting; Capacity to capture rich annotation of parts of data | |
Common European Research Information Format (CERIF) | XML |
record research information about people, projects, outputs (publications, patents, products), funding, events, facilities, equipment | |
Metadata Encoding and Transmission Standard (METS) | XML |
encoding descriptive, administrative, and structural metadata regarding objects within a digital library | |
Metadata preparation/markup guidelines QualiBank | |
QualiBank Processing Procedures | |
Metadata preparation/markup procedures QualiBank | |
Qualitative data collection ingest processing procedures |