Loquendo
On mixed-language capabilities for Text-To-Speech systems

Author: Paolo Baggia, Loquendo
Date: July 15, 2004.

The fact that the quality of Text-To-Speech (TTS) rendering has greatly improved in recent years is well known. The new generation of speech synthesis engines is based on a large database of recorded speech; we call this technique Unit Selection and we were among the first companies to deploy it in a TTS product. (See [TTS] for more details on the Unit Selection speech synthesis technique.) The main goal of the previous generation of TTS systems was to reach intelligibility, but the produced speech sounded still unnatural, robotic-like and so unfriendly, even if it was almost perfectly intelligible. (If you want to hear it, please choose Italian Mario voice here and test it yourself!)

The quality today is good, but obviously it can be improved further. A TTS challenge that will be analyzed in this paper is how to render a text that mixes more than one language by TTS. We call this mixed-language capability and it is a new feature recently added to the Loquendo TTS (vers. 6.2). Is this feature useful? I'd say yes, think of texts coming from different sources in unpredictable languages (e.g. Internet), or e-mails that may contain puzzles of attached pieces written in a mixture of languages, or advertising slogans, movie titles, reported news, proper names, song titles, etc.

How to address mixed-language? If you simply try to render a mixed-language text with a mono-lingual synthesis engine, the result will be at best funny, at worst almost incomprehensible. Rules of phonology, coarticulation, morphology, and syntax will clash together to produce an awful result, not useful at all for a real life speech application.

The first idea is to choose a bi-lingual, or even multi-lingual, let us say, a polyglot speaker talent and record her/him speaking in different languages. Fine, but there is still a technical issue, which is the language of texts to be rendered by TTS. You need a language guesser! It is a software that is able to guess the language of a piece of text. Loquendo has implemented it! (Try here.) The Loquendo Language Guesser is trained on a large amount of texts in different languages to improve its discriminative power. It actually covers all the languages deployed by Loquendo TTS. The prediction accuracy is higher for longer text portions and can be improved by reducing the number of languages in alternative.

Let us start with an example, if the sentence to be spoken is: "Hello Mme. Françoise Dupont can I help you?" the Language Guesser has to identify that: "Mme. Françoise Dupont" is to be spoken in French. So that the result of the Language Guesser may be the following:"Hello \lang=French Mme. Françoise Dupont \lang can I help you?", that actually is the format in which you can direct Loquendo TTS if you want to tag different languages by yourself.

It is worth noticing that to select a language by the Language Guesser allows the synthesis engine to change the language knowledge that is mandatory for TTS. The first step is to apply a language dependent text processing, for instance in the previous example "Mme." has to be expanded into the French word "madame" (corresponding to "Mrs" or "madam" in English). Then the word has to be trasformend into a string of phonemes in French, therefore "madame" will become /mɑdˈɑm/ as it is spoken in French, instead of /ˈmædəm/ as it is spoken in English. The phonemes are the units that compose a spoken language, and in the examples are represented in a simplified form. Actually the phoneme-sets for different languages are different.

Is the mixed-language issue solved? Not at all, even if the Language Guesser helps to select the correct transcriber in the speech engine, it is very difficult to find a good bi-lingual speaker. To find a penta-lingual, or an epta-lingual speaker is a mission impossible! In the reality we have successfully found Castilian and Catalan speakers, for instance our Carmen, Spanish Castilian female voice, and Montserrat, Catalan female voice, are recorded by the same speaker, and also the male voices Jordi, Catalan, and Jorge, Castilian Spanish, too. This is the exception that confirms the rule!

Another possible option is to use the Language Guesser to select the language and then to switch to a different voice of the guessed language. This is possible and in some contexts it is a good solution, but in general it is very annoying to have voice changes too often. In the example above, you should change the voice in the middle of a sentence, that is not very nice.

The final idea was to try to map the phonemes of one language into another, we call this technique Phoneme Mapping, see [MixLang] for a detailed description. Let us try to describe it: the idea is that if a text written in language lang1, which includes in it a piece written in language lang2, we need to first identify the right piece written in lang2, by using the Language Guesser, then to transform it to the phonemes of the lang2 language, as a native speaker will do for reading it, and finally to map the phoneme string into the container language lang1 phoneme-set.

Let us look at the previous example:

Hello \lang=French Mme Francois Dupont \lang can I help you?

The first step is to transcribe each piece of text with a different transcriber according to the tagging done by the Language Guesser. The result is the following where the transcribed French words are marked:

/həlˈə͡ʊ/ /mɑdˈɑm/ /fχãswˈɑz/ /dypˈõ/ /kæn/ /a͡ɪ/ /hˈelpʰ/ /ju/

The next step is to map the French phonemes into English ones. The result is the following:

/həlˈə͡ʊ/ /mɑːdˈɑːm/ /fɹɑːnswˈɑːz/ /dupʰˈɔːn/ /kən/ /a͡ɪ/ /hˈelp̚/ /ju/

The last step is a re-processing of the English transcription to perform an allophonic substitution.

Pros&Cons: The result is that the whole text is rendered by the same voice and you are not required to change the voice in the middle of a sentence to take into account the inserted language. The Phoneme Mapping is a kind of an approximated pronunciation. It will sound like a well trained English speaker reading French. At a first sight this approximation seems to be a drawback; however, it is actually a quite good result, because if you mix an English text into a correctly pronounced French piece, you will run into a range of difficult problems.

In fact, a speaker having to pronounce foreign words included in a text written predominantly in her/his own language will be inclined to pronounce these words in a manner that may differ - also significantly - from the correct pronunciation of the same words when included in a complete text in the corresponding foreign language, for instance coarticulation into the two languages is different. The approximation of this kind of pronunciation is especially due to the speaker choice of maintaining his native-tongue phonological system, but also to co-articulation, economy of effort, and to psychosocial factors. Even a trained speaker will do the same approximation to avoid the worst tongue contortions.

Now, it is time to stop reading and to do instead some exercises, you can try some mixed-language here and judge by yourself. We know that there is room for further improvement, but the method above described is efficient, language independent, entirely phonetics-based and it enables any Loquendo TTS voice to speak all the languages provided by the system. It is a good start to be used in real applications.

We are at the end of this story and much would be possible to tell, for instance to describe the new Audio Mixer feature of TTS. Perhaps we will discuss this in a future issue, for the moment you can read more on the website, here.

----------------------------------------------------------------------------------------------------------------------------------------------------
References:

If you are interested in going more deeply into how a speech synthesis engine is implemented or finding out more about mixed-language capabilities, you are invited to read some additional papers, such as the following:

[TTS]
Silvia Quazza, Laura Donetti, Loreta Moisa, Pier Luigi Salza, "ACTORŽ: A Multilingual Unit-Selection Speech Synthesis System", Proc. of 4th ISCA Tutorial and Research Workshop on Speech Synthesis, Atholl, Scotland, 2001. [pdf]
[MixLang]
Badino Leonardo, Barolo Claudia, Quazza Silvia, "A General Approach to TTS Reading of Mixed-Language Texts", Proc. of 5th ISCA Tutorial and Research Workshop on Speech Synthesis, Pittsburgh, PA, 2004. [pdf]
 
----------------------------------------------------------------------------------------------------------------------------------------------------