Loquendo
SSML 1.0: an XML-based languaged aimed to improve TTS rendering

Author: Davide Bonardo, Paolo Baggia (Loquendo)
Date: September 09, 2004.

The Speech Synthesis Markup Language (SSML), the new standard way of producing content to be spoken by a speech synthesis system, is now a W3C Recommendation (see W3C press release and testimonials issued on September 8th 2004 or the full specification at www.w3.org/TR/speech-synthesis/). SSML is an XML-based markup language, which is aimed to control a Text-To-Speech (TTS) for a variety of application contexts.

The functional and practical goal of a TTS engine is to generate spoken output from textual or XML-based documents, like SSML. The TTS is not only useful for people suffering visual impairments to read texts and interact with computers, but also for many applications that employ voice-output technologies; for example: alarms or announcements (e.g. in a railway station), interactive telephone services, information services (weather forecasts, traffic conditions, news...), banking, e-mail or fax reading, games, etc. In all such very different applications, TTS generates the right spoken output corresponding to the message to be communicated. In many cases the usage of TTS may replace the recording of prompts from a professional speaker, a very costly activity. Moreover, a TTS may be fine-grain controlled for changing some parameters to improve and optimize acoustic rendering. SSML helps specifying in detail how to utter a text, a paragraph, a sentence, a phrase or a single word.

To be SSML compliant is becoming a must for TTS engines and voice platforms, because the SSML specification is part of a larger set of markup specifications for voice browsers, called W3C Speech Interface Framework, which has been finalized by the Voice Browser Working Group (www.w3.org/voice). Even if the SSML is a "new" markup language, it is already crucial in many contexts. The Voice Extendible Markup Language (VoiceXML 2.0), which enables Web-based development of speech applications, includes and extends SSML for prompt messages in speech applications. In multimodal applications, the SSML is mandatory both for the Speech Application Language Tags (SALT) prompts and for the XHTML+Voice (X+V 1.2) language, which reuses a modularized VoiceXML embedded into XHTML visual pages. The SSML is prescribed for prompts in the Synchronized Multimedia Integration Language (SMIL) and in the Media Resource Control Protocol (MRCP) interaction protocol.

Finally, the Cascading Style Sheets (CSS), employed to augment standard visual documents (like HTML), is developing a new CSS3 Speech Module which is designed to produce SSML content. Using of CSS3 Speech Module, it will be very easy to listen to a Web page instead of reading it.

Inside a TTS engine
Looking at the basics of TTS technology might be useful to understand how the SSML reading instructions may affect the behavior of a TTS engine. While different TTS systems may adopt different algorithms to obtain automatically the correct output audio signal, a generic synthesis process can be described as shown in the Figure 1.


Figure 1 - A generic TTS system

The Text Analysis step aims at converting the input text into a more detailed and precise description of what should be pronounced. The exact specification of the sounds (phonemes) to be synthesized and their intonation requires analyzing the text, solving its ambiguities, interpreting punctuation, symbols, acronyms, etc. Once specified what should be pronounced, the Speech Synthesis step generates the actual speech signal. Different techniques may be applied. The most effective in yielding high-quality and natural-sounding speech is the Unit Concatenation synthesis technique, which relies on a database of human speech samples from which the most suitable segments are extracted and combined to match the input text [TTS].

SSML permits to direct the Text Analysis step as well as to specify some features of the voice used by the Speech Synthesis step. It might be useful to look more closely at the Text Analysis step, where SSML may have a major role. To obtain a precise description of the pronunciation of the input text, the TTS engine should perform the following tasks:

  1. Text normalization
    The input text should be segmented into sentences and words, relying on blanks and punctuation marks. Numbers, special symbols, acronyms, abbreviations, should be conveniently expanded to normalize the text into a standard format, consisting of graphemes (the words written in the text) and punctuation marks.

    For example the input text to be spoken: "Dr. John Smith lives at Jefferson dr. 94, Paris, Texas." will be normalized into: "Doctor John Smith lives at Jefferson drive ninety-four {pause} Paris {pause} Texas {pause}" Currencies will also be expanded; if the input text is: "The winner will receive $1000 cash!" it will be normalized into: "The winner will receive one thousand dollars cash! {pause}."

  2. Word pronunciation (lexical stress, phonetic transcription)
    Each word must be assigned its lexical stress and should be converted into a phonetic representation. Rules and lexicons are designed for this purpose. The letter-to-sound (grapheme-to-phoneme) correspondence is relatively direct for some languages, while it can be highly unpredictable for others, like English. In these cases word pronunciation can be considered lexicon-dependent.

    An example of a grapheme-to-phoneme transformation is the following: "The page cannot be displayed!" will be transcribed as a sequence of phonemes: /ðə/ /pˈeɪʤ/ /kˈænɑːt̚/ /bˈiː/ /dɪsplˈeɪd/ (the phoneme alphabet used in this example is IPA, that stands for International Phonetic Association [IPA]).

    In addition to the rules and lexical knowledge implemented in the TTS engine, user-defined exceptions usually can be added in order to improve the TTS pronunciation on the application domain. It is possible to put a phonetic transcription directly in the text or to use a lexicon file containing all the transcriptions, acronyms and abbreviations (see 1).

  3. Sentence structure
    Intonation and rhythm, and in some cases also stress and phonetic transcription, depend on the role of words in the sentence, i.e. on sentence structure. What is reflected in prosody (prominence, phrase separation, pauses, intonation) is the meaning of the sentence, but for synthesis systems the bridge between text and prosody is generally syntax. Syntax-based prosodic rules are usually implemented in TTS engines in order to determine the position of breath pauses, the relative prominence of words and the most suitable intonation.

  4. Sentence-level modifications in phonetic transcription
    Phonetically transcribed words should be concatenated in the sentence. This requires some modifications in the phonetic transcription, accounting for changes due to continuous speech, also called co-articulation phenomena (phoneme change, reductions, de-accentuation).

  5. Computation of prosodic parameters
    The analysis of the text in terms of a sequence of phonemes is completed when their prosodic description is given. Based on the general prosodic structure previously determined (step 3. above), the target intonation and rhythm are assigned to the phoneme sequence and expressed either with numerical values or with prosodic labels, depending on the adopted synthesis technique. Prosodic values assignment can rely on explicit rules or be performed by an automatic learning system.

SSML directs all Text Analysis steps, providing a standard way to control aspects of speech such as pronunciation, acronym expansion, volume, pitch, rate, range, duration, pause, emphasis, etc., across different synthesis-capable platforms.


Basic description of SSML language
The following simple and explicit example, might be useful as reference to understand the SSML syntax.

Example 1

<?xml version="1.0"?>
<speak version="1.0" xml:lang="en-US">
   <voice name="Dave">
	Hello, world; my name is Dave.
   </voice>
</speak>

In this example the voice named "Dave" should pronounce the following sentence: "Hello, world; my name is Dave."

Like other XML-based markup languages, SSML is composed of elements. All elements can have some attributes, which will be briefly described in the following.
The SSML language root element is <speak> it contains the SSML text to be spoken. The <speak> element has two required attributes: xml:lang (necessary to specify the language to be used for speaking) and version (indicating the version of the specification, currently "1.0"). Besides these two attributes, there are a few optional ones, such as: xmlns, used to indicate the XML Schema for SSML namespace; xsi:schemalocation, which is useful to indicate the location of the XML Schema; xml:base, for specifying the base URI of the root document.

The following table shows which SSML elements are associated to the five points of Text Analysis described before; the single elements will be described following.


Table 1 - SSML elements for controlling Text Analysis step

All other elements composing the SSML language and the text to be spoken are contained within the root <speak> element, but the <meta>, <metadata> and <lexicon> elements must occur immediately after the root element. <meta> and <metadata> elements help to annotate the document (i.e. author, date, etc.)

This is an example specifying a lexicon which helps to transform the abbreviations commonly used in SMS text messages into a normalized and speakable form.

Example 2

<?xml version="1.0"?>
<speak version="1.0" xml:lang="en-US">
	<lexicon uri="file://localhost/share/SMSLexicon.lex"/>
	   <!-- This is the text of a SMS to be read using a
	        lexicon file. -->
	   I love U :-)
</speak>

The main elements for expressing the text structure (step 3. above) are: <p>, representing a paragraph, and <s>, representing a sentence. Another one is the <phoneme> element, which provides in the ph attribute a phonemic/phonetic pronunciation for the contained text. It is possible to choose different phonemic/phonetic alphabets using the alphabet attribute ("ipa" for the International Phonetic Alphabet [IPA]). IPA alphabet is mandatory, but different engines can support other pronunciation alphabets (e.g. use "x-loquendo" for Loquendo alphabet).

Example 3

<?xml version="1.0"?>
<speak version="1.0" xml:lang="en-US">
	Tipical italian dishes are 
	<phoneme alphabet="ipa" ph="&#112;&#712;&#105;&#720;&#116;&#794;&#115;&#601;">
	  pizza
	</phoneme>
	<-- "pizza" will be pronunced as: pˈiːt̚sə -->
	and
	<phoneme alphabet="ipa" ph="&#115;&#112;&#601;&#103;&#712;&#101;&#116;&#812;&#105;">
	  spaghetti.
	</phoneme>
	<-- "spaghetti" will be pronunced as: spəgˈet̬i -->
</speak>

To expand an abbreviation (step 1. above), the <sub> element will be used. It indicates that the contained text for pronunciation will be replaced by the alias attribute value.

The Text Analysis step 1 showed how a TTS system works automatically. As the knowledge contained in lexicon files (for abbreviation, acronyms, special symbols, etc.) might be not enough, it is possible use the <sub> element to express the expansion or abbreviations. Supposing to use a TTS to synthesize the phrase: "The SSML v. 1.0 is a new standard of W3C" the TTS output can be: "The SSML v. one point o is a new standard of W3C". Using the SSML the synthesis will improve:

Example 4

<?xml version="1.0"?>
<speak version="1.0" xml:lang="en-US">
	The <sub alias="Speech Synthesis Markup Language">SSML</sub>
	<sub alias="version">v.</sub> 1.0 is a new standard of
	<sub alias="World Wide Web Consortium">W3C</sub>.
</speak>
All the acronyms and the abbreviations have been expanded.

In order to control prosody and style (step 5. above) one may use the <voice> element and its attributes (xml:lang, gender, variant, age, and name) to request the most suitable voice, or the <emphasis> and <break> elements, to respectively increase/decrease emphasis in a sentence and to insert or remove pauses.

The following is an extract from Hamlet, Act I, Scene 1. The dialog is rendered changing voices for different characters on the base of variant attribute. "1" is the narrator; "2" is Bernardo, an officer; "3" is Francisco, a soldier.

The example includes some <emphasis> and <break> elements.

Example 5

<?xml version="1.0"?>
<speak version="1.0" xml:lang="en">

	<voice gender="male" variant="1">
	  Elsinore. A platform before the Castle.
	</voice>

	<voice gender="male" variant="2">
	  <emphasis>Who's there?</emphasis>
	</voice>

	<voice gender="male" variant="3">
	  Nay, answer me: stand, and unfold yourself.
	</voice>

	<voice gender="male" variant="2">
	  <emphasis level="strong"> Long live the king! </emphasis>
	</voice>

	<voice gender="male" variant="3">
	  Bernardo?
	</voice>

	<voice gender="male" variant="2">
	  He.
	</voice>

	<voice gender="male" variant="3">
	  You come most carefully upon your hour.
	</voice>

	<voice gender="male" variant="2">
	  'Tis now struck twelve. Get thee to bed, Francisco.
	</voice>

	<voice gender="male" variant="3">
	  For this relief much thanks: 'tis bitter cold,
	  And I am sick at heart.
	</voice>

	<voice gender="male" variant="2">
	  Have you had quiet guard?
	</voice>

	<voice gender="male" variant="3">
	  Not a mouse stirring.
	</voice>

	<voice gender="male" variant="2">
	  Well, good night.
	  <break/> If you do meet Horatio and Marcellus,
	  The rivals of my watch, bid them make haste.
	</voice>

	<voice gender="male" variant="3">
	  I think I hear them. Stand, ho! Who is there?
	</voice>
</speak>

Of course this is a joke. Current TTS is not suitable for interpreting plays. Nevertheless, some dialogs may be fruitfully rendered.

With its attributes to define pitch, contour, range, rate, duration, and volume the <prosody> element might be helpful for a closer control of intonation. The next example shows how to use the <prosody> element and its attributes.

Example 6

<?xml version="1.0"?>
<speak version="1.0" xml:lang="en-GB">
	Using prosody element it is possible 
	<prosody rate ="x-fast" >
	  read fast a sentence
	</prosody>
	or 
	<prosody pitch ="+60Hz">
	  manipulate the voice pith
	</prosody> 
	and other important parameters.
</speak>


The SSML also provides controls for the Speech Synthesis step (see Figure 1). The <voice> element takes effect in this step, allowing to choose a specific voice; the <audio> element inserts an audio file in the synthesized text, and the <mark> element inserts a bookmark that will be used for control and synchronization at the application level.


Final remarks

The SSML simplifies the use of TTS in a variety of contexts and applications. Resources for further study of the SSML are the W3C specification itself or tutorials, very good the tutorial written by Jim Larson (Voice Browser Working Group co-chairman), it is here. If you want to test SSML documents please use the Loquendo TTS online interactive demo:here.

--------------------------------------------------------------------------------------------------------------------------------------------------
References:

If you are interested in a deeper analysis of how a speech synthesis engine is implemented, you are invited to read the following paper:

[TTS]
Silvia Quazza, Laura Donetti, Loreta Moisa, Pier Luigi Salza, "ACTORŽ: A Multilingual Unit-Selection Speech Synthesis System", Proc. of 4th ISCA Tutorial and Research Workshop on Speech Synthesis, Atholl, Scotland, 2001. [pdf]
 
[IPA]
International Phonetic Association.
See http://www.arts.gla.ac.uk/ipa/ipa.html for the organization's website.
---------------------------------------------------------------------------------------------------------------------------------------------------