ASR (Automatic Speech Recognition) is the transformation of spoken content into its textual representation. Or simply said: Speech-to-Text.

 

A wide area of different technologies is or can be used in the various stages of the “Oral History workflow”.

The workflow can be devided in 5 different stages:

  • Recording the interview
  • Transcribing the spoken content
  • Metadata addition
  • Analysing the interview
  • Disclosing the interview

Recording

Technology for the recording will be mentioned just briefly in this document.

Transcribing

The best known, and probably most wanted technology is “Automatic Speech Recognition” or ASR: writing down every word spoken in an OH-interview. ASR, especially since the use of Deep Neural Networks (DNN) becomes incredible good, but a 100% correct and errorless transcription will probably remain a dream. If ASR doesn’t work, human transcription is the alternative. Finally Forced Alignment will result in the start and end time of each word spoken. By hand and/or with speaker diarization each part of the interview can be dedicated to just one speaker (if they do not talk at the same time).

Metadata addition

One the transcriptions are made, one may use them to make summaries (of the full interview, of fixed time intervals (each 5 or 10 minutes) and/or from “homogeneous parts” (chapters)” in the interview. Moreover, often recurrent themes in the interview collection can be appointed.

Analysing

Once the transcriptions are done, other technologies can be applied. Upon the transcriptions all known text-analyses technigues can be applied. One may think of text mining, concordance between different transcription, topic clustering, named entity recognition and much more.

Because the audio signal is present and the start and end time of all words are known, dialogue analyses can be done as well. If the recoding is a video, image analysis can be done too (but this topic will not be discussed here).

Named Entitity Recognition (NER) can be used to “mark” all the named entities (places, persons, organisations, companies, etc.) in the interview. Thereafter these entities can be linked to information outside the interview (i.e. Wikipedia) and places can be visualised via an icon on GoogleMaps and via so-called HeatMaps on GoogleFusion.

For the real analysis of the content of the interview, many scholars use (commercial) software like LIWC, ATLAS.ti, NVivo, MAXQDA and others. Thes software packages may use the transcriptions, but their “output” is not open. i.e. they all use propriety (=closed) data formats that can only be read by their own software.