Forced Alignment

Forced Aligment is a kind of Automatic Speech Recognition but it differs in one major respect. Rather than being given a set of possible words to recognise, the recognition engine is given an exact transcription of what is being spoken in the speech data. The system then aligns the transcribed data with the speech data, identifying which time segments in the speech data correspond best to particular words in the transcription data.

Below, an explanation of what it is nd how it works. For a description of "how to align your text with audio with the help of software (i.e. WebMAUS)", see /software/forced-alignment.

forced alignment

Difference between spoken and written text

Ideally spoken, the written text to align with the audio is an exact copy of what was said (see image aboove). But often, transcriptions are a corrected, interpreted version of what was said.
If some says "If I recognize well, I was, eh I mean, we went to the store, eh to the grossary store" often it will be transcribed as "If I recognize well, we went to the grossary store". All hesitations, internal corrections of the speaker and repititions are skipped. The result is a grammatically correct version of what was meant, but it does differ from what was said. Forced Alignemnt (FA) can handle this kind of deviations between written and spoken content as long as the difference is not too big.

It becomes more different if parts of the written text has been switched like:

spoken writtenIf the spoken "I want a beer" part is correctly matched on the written version, the part before "I want a beer" has no text en the written part (because I'm thirsty" has nog time. For searching this isn't a big problem, but if the aligned text will be used for subtitling, some hand-made correction will be necessary.


Unknown words and Phonetic Transcriptions

A problem may arise if the text (and audio) contain foreign words, abbreviations and (large) numbers. In order to align, the written text is transformed into phonemes according the language setting. But how to deal with words that are strange for the target language? Suppose you have a written phrase like "Mrs Uytendenboogaard and Mr. Chukwuemeka have a boat trip on the Bhagirathi River". This phrase combines a Dutch and a Nigerian family name with the name of an Idian river. What will be the correct pronouncation?

The first step in each FA-proces, is the transformation (= phonetic transcription) from words to phonemes. Then, these phonemes will be matched with the spoken content. This means that the phonetic transcription is crucial but this transcription is language-dependend! E.g. the word boat will be transcribed as /b o: t/ when you use the English G2P (= Grapheme-to-Phoneme) converter. But if you use the Dutch G2P, the phonetic transcriptin will be /b o: A t/.

If we pass the example sentence in a G2P, it will result in:

TXT Mrs Uytendenboogaard and Mr. Chukwuemeka have a boat trip on the Bhagirathi River 
 UK mIs@z uytendenboogaard {nd mIst@r chukwuemeka h{v @ bot trIp An D@ bhagirathi rIv@r>;
 NL Em Er Es Ytd@nboxart And Em Er XukwuEm@ka hav@ a boAt trIp On d@ bagirati riv@r

The transcription was done with an automatic G2P for English and Dutch. The blue words mean: "No idea how to pronounce it". The Dutch G2P decided to spell the unknown words (so Mr. becomes Em Er). The difference between these two transcriptions is huge and the Foreced Alignements with the two transcription may significantly differ as well.

This small example showes that the phonetic transcription, crucial for correct Foreced Alignment, is less naturally than often expected. Abbreviations and unknown/foreign words may ruin the alignment. A good FA-system must therefore interact with the user about the phonetic transcription: presenting the user the best guess of initially unknown words. The user may enter the phonetic transcription of the words (using the phonetic alphabet) or a "sound-as" word ("Bhagirathi" sounds as "bakirati"). Once the initially unknown word is transcribed by the user, it may entered into a phonetic dictionary for further use.

Audio quality

The quality of the audio-signal (i.e. the speech sound) may influence the accuracy of the alignment. The better the audio (less or no background noise)