transcription chain

A transcription chain is a set of concatenated "steps" one has to perform in order to go from a "recorded interview" to a findable, accessible and viewable digital AV-document with relevant metadata on the Internet.

Here below, we will explain the picture above: the various steps one may take in a transcription chain. It has to be stressed that there is no "perfect" solution. The Transcription Chain is a set of consecutive steps one may take to reach the final goal: a findable, accessible and viewable digital AV-document.

If the audio quality is good and a 100% verbatim transcription of the spoken speech is not necessary, automatic speech recognition without human check and improvement, may be enough. On the order hand, if the audio quality is poor, people speak in dialect and a verbatim transcription is necessary for a thorough dialogue analyses, human transcription with a forced alignment by the computer may be necessary.

From an interview to a digital repository

interviewsBasically, one may divide the interviews in existing (already recorded) and new (to be made) and in analogue and digital. Analogue interviews cannot be accessed in another way than “going to the place where the physical object is stored and playing it on a similar device as it was recorded”. So, to make them accessible via the Internet, analogue recordings need to be digitized. Moreover, it is not very likely that many new interviews will be recorded on an analogue device.

 So in the process "From interview to a digital recording suitable for recognition / alignment" we have three possibilities.

Digitizing Analogue recordings

A simple version of the digitization process is:

Play the analogue recording and send the analogue output via an AD-converter to a computer where it will be stored as a digital file.

This is true, but to get the best quality one need to consider a lot of “issues”.

compact cassette internals diagramAdjusting the playing device

The quality of the analogue playing device must be as good as possible. Using “just a tape recorder or cassette player” may introduce all kinds of (disturbing) artefacts in the digital version of the recording. A nice overview of good practice is given by Kees Grijpink of the Meertens Institute in Amsterdam.

During a visit to the Meertens Institute Kees showed how one need to open a compact cassette, clean the cassette, remove al dust, clean the tape guides and "make them going", remove, clean and replace the tape heads with alcohol , make the pressure pad working again (they dry-out and become less flexible) and, very important, adjust the azimuth of the tape head (for a nice video instruction see here).

Modern software can (partly) correct for azimuth deviations, but it is prefered to do it when one digitize the recording.

More or less the same need to be done with tapes and tape recorders.


AD convertorThe analogue signal will be digitized via a so-called AD-convertor. Basically, the analogue signal is sampled each x-milliseconds and the value of the analogue sample at that time, is stored as a number in a digital file. The smaller the interval between two sample points, the better the resulting quality, but the more disk-space a recording will need.

The amount of samples per time interval is called the Sample Frequency and is expressed in Herz (Hz).  A sampling frequency of 16.000 Hz (=16 kHz) means that the sound-wave is measured 16.000 times per second.

Human Hearing and Voice

But what is an ideal sampling frequency for spoken interviews, for the human voice?

  • The range of the human hearing is about 20 Hz to 20 kHz (depending on the age of the listener) and is most sensitive at 2 to 4 KHz: the domain of the human speech.
  • The normal human voice range is between 80 Hz to 8 kHz


An important criterium is the so-called Nyquist sampling criterion. Without becoming too detailed (see the link for more information) one may say that if the maximum frequency of a sound wave is N (i.e. the sound wave trills N-times per second) the amount of samples / second one need to record is 2xN. The maximun range of the human voice (in speech) is ±8000Hz so the maximal sampling frequency will be around 16.000 Hz.

Higher and Higher?

For years, 16kHz with a 2-byte (=16-bits) sample was considered as the optimal sampling frequency for the recording of the human voice. Higher sample-frequencies will result in more data but not in a higher quality of the recording of the human voice.

Nearly all ASR-engines require a sampling frequency of 16 kHz. However, because memory is really cheap and plentiful available and a lot of research is done on the "improvement" of the sound quality, it may be wise to follow the advise of the European Broadcast Union and use a much higher sampling rate: 96kHz and a 4-byte (=32 bits) value.

This will increase the amount of disk-memory with a factor 12. So do it if you have enough disk-memory available and store the files in a kind of black archive. For the "daily" use (sending the recordings to collegues, listen to speech and process the recording with an ASR-engine) use the 16-16 version.

Mono or Stereo?

If the original recording was in mono, use mono. Often the default settings of recording software is set to 44.1 kHz stereo recording.
One may argue that the high sampling freqency may be usefull in the near future, but a stereo-file from a mono-recording (2x the same signal on the left and the right channel) doesn't make sense.

When the original recording was done with 2 microphones (each on a separate channel) and there is a clear difference between the left and right channel, it is wise to convert it to a digtal stereo-file.

For new recordings this is different: stereo or even better: each speaker with his/her own microphone recorded on a separate channel should be ideal.

Soundcards & Software

When converting the analogue signal into a digital equivalent, the most easy and cheap solution is to plugin the line-out from your tape/cassette recorder in the input-hole of the soundcard of your computer.

lineout  kabel 



This may result in "acceptable" results, but the quality of the digital-reording heavenly depends of:

  1. the quality of the sound card in your computer
  2. the software program used for the recording

Sound cards

usbsoundboxIf you plan to digitize a lot of analogue recordings, it may be wise to invest a bit in a good sound card. The best, and most practical solution is an external USB-soundcard. The advantage of such a device is that:

  1. it is portable and you can use it with different computers
  2. you’ll typically get higher-quality audio processing than you get with your PC’s built-in sound card. The best of these units offer 24-bit/96kHz digital audio


There are many free and paid software packages that can be used for the recording. A nice overview of some free tools can be found here.

If you eally plan to do massive digitizing of you analogue collection, it may be wise to buy a (semi-)professional one where you can do things as azimuth correction, noise cancelling, filtering and more. Some good packages are:

 vragen aan Kees




Transcription is a translation between forms of data, most commonly to convert audio-visual recordings to text in qualitative and quantitative research. It should match the analytic and methodological aims of the research. Whilst transcription is often part of the analysis process, it also enhances the sharing, disclosure and reuse potential of research data.
Full transcription is recommended for data sharing.

terrorismeScreenshot of the manual correction (in SubtitleEdit) of the transcription that was generated with a (Vocapia) ASR-engine.

If the transcription is done with ASR or Forced Aligment, each transcribed/spoken word will automatically get a start- and end-time. This makes it possible to access the AV-files directly on the word-level: clicking the selected fragment in the search window may result in playing that fragment aloud.

Separation of content and presentation

Transcripts contain (a lot of) information that can be parsed by computers and humans. Human parsing is robust for small errors but computer parsing is not.
The content is therefor best written in XML (or JSON) using UTF-8. XML enforces a structured way of storing the the data, making it possible to unambiguously parse the transcripts with a computer.

Storing the transcripts in a text-editor format (e.g. docx or pdf) is therefor not recommended. Small, nearly noticable, errors may disable the parsing of the transcript. For example by using a less suitable font: Rl and RI look the same when using the helvetica-font (but clearly different when using the courier-font Rl and RI).
The same is true for the use of a hard-return (RETURN) and a soft-return (SHIFT-RETURN). For the human eye it looks the same but not for a computer, so parsing may go wrong.


XSLT schema3 XSLT-files (left) for export to third party software and 3 XSLT-files (right) for reading by humans


When presenting the transcripts, XSLT-files can be used to generate a human-readable document that

  1. shows just the information that is desired (for example all information or only the text of the transcript) 
  2. presents the information in the look-and-feel of the institution (font, size, colours, etc.) including logo's and standard text.

Finally, if the layout of the transcripts need to be modified, only one XSLT-file need to be changed (in stead of hundreds of word-files).

Use of transcripts in third-party software

When planning the structure of the transcription template, best practice is to:

  • Consider compatibility with the import features of qualitative data analysis software. Which information is needed (a must) or nice-to-have in that particular analysis software package and which information can not be used (so does it make sense to collect that info in the transcription documents?). Again, an XSLT-file can be used to generate XML-files that can be imported in the third-party software.
    Moreover, different XSLT-files can be used to generate different export-files for different third-party software (for example one XSLT for export to AtlasTI, another XSLT for export to MaxQDA).
  • Write transcriber instructions or guidelines to get consistancy in the transcripts, especially when different people make or correct the transcriptions. How to deal with non-verbal, not-understandable or inauditable speech? How to write foreign or dialect words? How to mark sensitive information for later anonymisation?
  • Provide a translation or at least a summary of each interview in English, when the speech is in another laguage.
  • Never trust the transcription results of ASR-software (automatic speech recognition). ASR becomes better and better but the software cannot recognise words that are not in the vocabularct (jargon, foreing and dialect words, acronyms, abreviations, etc. ).

Transcription methods

Transcription methods depends very much upon your theoretical and methodological approach, and can vary between disciplines.

  • A thematic sociological research project usually requires a denaturalised approach, i.e. most like written language (Bucholtz, 2000), because the focus is on the content of what was said and the themes that emerge from that.
  • A project using conversation analysis would use a naturalised approach, i.e. most like speech, whereby a transcriber seeks to capture all the sounds they hear and use a range of symbols to represent particular features of speech in addition to the spoken words; for example representing the length of pauses, laughter, overlapping speech, turn­taking or intonation.
  • A psycho­social method transcript may include detailed notes on emotional reactions, physical orientation, body language, use of space, as well as the psycho-dynamics in the relationship between the interviewer and interviewee.
  • Some transcribers may try to make a transcript look correct in grammar and punctuation, considerably changing the sense of flow and dynamics of the spoken interaction. Transcription should capture the essence of the spoken word, but need not go as far as the naturalised approach. This kind of transcripts is, in combination with forced alignment, often used for the automatic generation of subtitles.

Reference: Bucholtz, M. (2000) The Politics of Transcription. Journal of Pragmatics 32: 1439­1465.

(this text is partly based on the information on the UK Data Service website)


Depending on the context of how an interview is created, as part of a collection that is kept in an archive, or as research material for an individual researcher, a number of characteristics will have been attributed to the recording, we call that 'meta-data'.

In the first case, the archival context, attributing meta-data is standard procedure.
In the second case, interviewing to publish a PhD thesis or an article, it depends on the kind of discipline the scholar is familiarized with whether this is required and whether re-use is considered as an option. 

Meta-data tell us something about who, when, and why the interview was created. If the meta data are created in a systematic way and abide to a standard, the interviews they refer to become searchable and can be processed with digital tools.
The possibilities of digital technology to open up oral history archives and support oral history scholarship, can only be fully exploited if both kind of interview material abide to the principle of metadata. This is why we encourage both kind of scholars, librarians/archivists as well as historians, anthropologists, linguists and sociologists, to attribute meta-data to their interviews and to abide to a standard.

Metadata Schemas


Data Documentation Initiative (DDI) XML
social science data, mandatory and optional metadata elements for study description, data file description and variable description, codebook version (DDI2 or DDI-C) and lifecycle version (DDI3 or DDI-L)
Dublin Core (DC) XML
basic, generic, discipline-agnostic, web resources, 15 (optional) metadata elements
Text Encoding Initiative (TEI) XML
for mark-up of textual data, e.g. turn takers in speech, typos, formatting text on screen
Data Cite XML, RDF
publishing digital datasets with persistent identifier (DOI), five mandatory and multiple recommended/optional elements, discipline-agnostic
ISO 19115 XML
geographic information
QuDex (Qualitative Data Exchange) XML
rich file-level description, document coding and annotation and intra-collection relationships. Allows identification of data objects such as: Interview transcript or audio recording etc.; Relationship to another data object or part of data; Descriptive categories at the object level, e.g. interview characteristics, interview setting; Capacity to capture rich annotation of parts of data
Common European Research Information Format (CERIF) XML
record research information about people, projects, outputs (publications, patents, products), funding, events, facilities, equipment
Metadata Encoding and Transmission Standard (METS) XML
encoding descriptive, administrative, and structural metadata regarding objects within a digital library
Metadata preparation/markup guidelines QualiBank  
QualiBank Processing Procedures
Metadata preparation/markup procedures QualiBank  
Qualitative data collection ingest processing procedures