G2P or Grapheme-to-Phoneme conversion, is a technology that, given the selected language, converts an orthographic transcription into a phonetic representation of that word. In other words: it tells you how the word probably will sound when spoken. There is an International Phonetic Alphabet (IPA), created and mantained by the International Phonetic Association, that describes all the phonemes of all the human languages of the world.
The IPA symbols are part of the Unicode standard and hence they can be used in any tool or word processor supporting e.g. UTF-8 encoding. For printing, the use of phonetic fonts is strongly advised, e.g. Doulos SIL. However, because standard keyboards do not feature IPA symbols, special input methods have to be used, e.g. graphical input palettes or key sequences. For easy typing and ASCII compatibility, the Speech Assessment Methods Phonetic Alphabet (SAM-PA) was created. SAMPA provides mappings of standard ASCII characters to IPA symbols.
Language dependency
The pronuncation of a word, depends of course from the language. The same orthographically written word is pronounced differently in various languages.
For example, the capital of the Netherlands, Amsterdam, is transcribed as:
- /'A m - s t @ r - d *A m/ in Dutch
- /*E m - s t @ r - d 'E m/ in English
- /*a m - s t @ r - d a m/ in Italian
Problems
There are a couple of issues when starting to align orthographic forms and transcriptions. The main issues are abbreviations or acronyms, proper names, foreign words and numbers: the written form may be very different from how it is spoken by the majority of native speakers.
Abbreviations
Easy problems are words without a vowel: the are spelled out or "replaced" by the original word.
- NCRV → /E n - s e - E r - v e/ (A Ducth broadcast organisation)
- Mr → Mister → /M I s - t @ r/
If words contain one or more vowels, it depends on the local attitudes.
- NOS → N. O. S. → /E n - o - E s/ (the national Dutch broadcast organisation). NOS can be pronounced as nos /n O s/, but no one does it.
- RAI→ /r Ai/ (the national Italian broadcast organisation). The word is perfectly pronouncable in Italian, so no one is using the spell-mode.
A particular kind of abbreviations are those that depend of the context.
- An appointment with dr. Corti → an appointment with doctor Corti
- An appointment on the Corti dr. → an appointment on the Corti drive
Proper names
Proper names, including geographic names, are often written according to historic standards of a language.
In German, "i" and "e" in names sometimes originally served to lengthen the vowel: "Voigt" → / f o: g t/, or "Itzehoe" → /I ts @ h o:/, but not so in "Michael" → /m I C a e: l/
Numbers
Numbers are special, too. A normal number 19 → is generally spoken as "nineteen" → /n Ai n - t i n/. But numbers are context sensitive as well:
- 1970 is generally spoken as "nineteen hundred seventy", but in "1970 €" it's "one thousand and seventy", and the year 2020 or "2020 €" are both spoken "two thousand and twenty".
- In English, "null", "zero", "naught" are all words for 0, in German 2 can be spoken as "zwei" or "zwo"
- My phone number is 621888146 → 6 2 1 8 8 8 (or 3 times 8) 1 4 6
- The cost in euro of that bridge is 621888146 → sixhundred twenty one million etc.
So, before applying G2P, one needs to preprocess – or normalise – the given text to obtain its most likely pronouncation.