|
Author: Davide Bonardo, Paolo Baggia
(Loquendo)
Date: September 09, 2004.
The Speech Synthesis Markup Language (SSML),
the new standard way of producing content to be spoken by a speech
synthesis system, is now a W3C Recommendation (see W3C press release
and testimonials
issued on September 8th 2004 or the full specification at www.w3.org/TR/speech-synthesis/).
SSML is an XML-based markup language, which is aimed to control
a Text-To-Speech (TTS) for a variety of application contexts.
The functional and practical goal of a TTS engine
is to generate spoken output from textual or XML-based documents,
like SSML. The TTS is not only useful for people suffering visual
impairments to read texts and interact with computers, but also
for many applications that employ voice-output technologies; for
example: alarms or announcements (e.g. in a railway station), interactive
telephone services, information services (weather forecasts, traffic
conditions, news...), banking, e-mail or fax reading, games, etc.
In all such very different applications, TTS generates the right
spoken output corresponding to the message to be communicated. In
many cases the usage of TTS may replace the recording of prompts
from a professional speaker, a very costly activity. Moreover, a
TTS may be fine-grain controlled for changing some parameters to
improve and optimize acoustic rendering. SSML helps specifying in
detail how to utter a text, a paragraph, a sentence, a phrase or
a single word.
To be SSML compliant is becoming a must for TTS
engines and voice platforms, because the SSML specification is part
of a larger set of markup specifications for voice browsers, called
W3C Speech Interface Framework, which has been finalized by the
Voice Browser Working Group (www.w3.org/voice). Even if the
SSML is a "new" markup language, it is already crucial
in many contexts. The Voice Extendible Markup Language (VoiceXML 2.0), which
enables Web-based development of speech applications, includes and
extends SSML for prompt messages in speech applications. In multimodal
applications, the SSML is mandatory both for the Speech Application
Language Tags (SALT) prompts and for the XHTML+Voice
(X+V 1.2)
language, which reuses a modularized VoiceXML embedded into XHTML
visual pages. The SSML is prescribed for prompts in the Synchronized
Multimedia Integration Language (SMIL) and in the Media
Resource Control Protocol (MRCP)
interaction protocol.
Finally, the Cascading Style Sheets (CSS), employed to augment
standard visual documents (like HTML), is developing a new CSS3
Speech Module which is designed to produce SSML content. Using
of CSS3 Speech
Module, it will be very easy to listen to a Web page instead
of reading it.
Inside
a TTS engine
Looking at the basics of TTS technology might be useful to understand
how the SSML reading instructions may affect the behavior of a TTS
engine. While different TTS systems may adopt different algorithms
to obtain automatically the correct output audio signal, a generic
synthesis process can be described as shown in the Figure 1.

Figure 1 - A generic TTS system
The Text Analysis step aims at converting
the input text into a more detailed and precise description of what
should be pronounced. The exact specification of the sounds (phonemes)
to be synthesized and their intonation requires analyzing the text,
solving its ambiguities, interpreting punctuation, symbols, acronyms,
etc. Once specified what should be pronounced, the Speech Synthesis
step generates the actual speech signal. Different techniques
may be applied. The most effective in yielding high-quality and
natural-sounding speech is the Unit Concatenation synthesis
technique, which relies on a database of human speech samples from
which the most suitable segments are extracted and combined to match
the input text [TTS].
SSML permits to direct the Text Analysis step
as well as to specify some features of the voice used by the Speech
Synthesis step. It might be useful to look more closely at the Text
Analysis step, where SSML may have a major role. To obtain a precise
description of the pronunciation of the input text, the TTS engine
should perform the following tasks:
-
Text normalization
The input text should be segmented into sentences and words,
relying on blanks and punctuation marks. Numbers, special symbols,
acronyms, abbreviations, should be conveniently expanded to
normalize the text into a standard format, consisting of graphemes
(the words written in the text) and punctuation marks.
For example the input text to be spoken: "Dr.
John Smith lives at Jefferson dr. 94, Paris, Texas."
will be normalized into: "Doctor John Smith lives at
Jefferson drive ninety-four {pause} Paris {pause} Texas {pause}"
Currencies will also be expanded; if the input text is: "The
winner will receive $1000 cash!" it will be normalized
into: "The winner will receive one thousand dollars
cash! {pause}."
-
Word pronunciation (lexical
stress, phonetic transcription)
Each word must be assigned its lexical stress and should be
converted into a phonetic representation. Rules and lexicons
are designed for this purpose. The letter-to-sound (grapheme-to-phoneme)
correspondence is relatively direct for some languages, while
it can be highly unpredictable for others, like English. In
these cases word pronunciation can be considered lexicon-dependent.
An example of a grapheme-to-phoneme transformation
is the following: "The page cannot be displayed!"
will be transcribed as a sequence of phonemes: /ðə/
/pˈeɪʤ/ /kˈænɑːt̚/
/bˈiː/ /dɪsplˈeɪd/
(the phoneme alphabet used in this example is IPA, that stands
for International Phonetic Association [IPA]).
In addition to the rules and lexical knowledge
implemented in the TTS engine, user-defined exceptions usually
can be added in order to improve the TTS pronunciation on the
application domain. It is possible to put a phonetic transcription
directly in the text or to use a lexicon file containing all
the transcriptions, acronyms and abbreviations (see 1).
-
Sentence structure
Intonation and rhythm, and in some cases also stress and phonetic
transcription, depend on the role of words in the sentence,
i.e. on sentence structure. What is reflected in prosody (prominence,
phrase separation, pauses, intonation) is the meaning of the
sentence, but for synthesis systems the bridge between text
and prosody is generally syntax. Syntax-based prosodic rules
are usually implemented in TTS engines in order to determine
the position of breath pauses, the relative prominence of words
and the most suitable intonation.
-
Sentence-level modifications in phonetic
transcription
Phonetically transcribed words should be concatenated in the
sentence. This requires some modifications in the phonetic transcription,
accounting for changes due to continuous speech, also called
co-articulation phenomena (phoneme change, reductions, de-accentuation).
-
Computation of prosodic parameters
The analysis of the text in terms of a sequence of phonemes
is completed when their prosodic description is given. Based
on the general prosodic structure previously determined (step
3. above), the target intonation and rhythm are assigned to
the phoneme sequence and expressed either with numerical values
or with prosodic labels, depending on the adopted synthesis
technique. Prosodic values assignment can rely on explicit rules
or be performed by an automatic learning system.
SSML directs all Text Analysis steps, providing
a standard way to control aspects of speech such as pronunciation,
acronym expansion, volume, pitch, rate, range, duration, pause,
emphasis, etc., across different synthesis-capable platforms.
Basic description
of SSML language
The following simple and explicit example, might be useful as reference
to understand the SSML syntax.
Example 1
<?xml version="1.0"?>
<speak version="1.0" xml:lang="en-US">
<voice name="Dave">
Hello, world; my name is Dave.
</voice>
</speak>
In this example the voice named "Dave"
should pronounce the following sentence: "Hello, world;
my name is Dave."
Like other XML-based markup languages, SSML is
composed of elements. All elements can have some attributes, which
will be briefly described in the following.
The SSML language root element is <speak> it contains the
SSML text to be spoken. The <speak> element has two required
attributes: xml:lang (necessary to specify the language to be used
for speaking) and version (indicating the version of the specification,
currently "1.0"). Besides these two attributes, there
are a few optional ones, such as: xmlns, used to indicate the XML
Schema for SSML namespace; xsi:schemalocation, which is useful to
indicate the location of the XML Schema; xml:base, for specifying
the base URI of the root document.
The following table shows which SSML elements are
associated to the five points of Text Analysis described before;
the single elements will be described following.

Table 1 - SSML elements for controlling Text Analysis step
All other elements composing the SSML language
and the text to be spoken are contained within the root <speak>
element, but the <meta>, <metadata> and <lexicon>
elements must occur immediately after the root element. <meta>
and <metadata> elements help to annotate the document (i.e.
author, date, etc.)
This is an example specifying a lexicon which helps
to transform the abbreviations commonly used in SMS text messages
into a normalized and speakable form.
Example 2
<?xml version="1.0"?>
<speak version="1.0" xml:lang="en-US">
<lexicon uri="file://localhost/share/SMSLexicon.lex"/>
<!-- This is the text of a SMS to be read using a
lexicon file. -->
I love U :-)
</speak>
The main elements for expressing the text structure
(step 3. above) are: <p>, representing a paragraph, and <s>,
representing a sentence. Another one is the <phoneme> element,
which provides in the ph attribute a phonemic/phonetic pronunciation
for the contained text. It is possible to choose different phonemic/phonetic
alphabets using the alphabet attribute ("ipa" for the
International Phonetic Alphabet [IPA]). IPA
alphabet is mandatory, but different engines can support other pronunciation
alphabets (e.g. use "x-loquendo" for Loquendo alphabet).
Example 3
<?xml version="1.0"?>
<speak version="1.0" xml:lang="en-US">
Tipical italian dishes are
<phoneme alphabet="ipa" ph="pˈiːt̚sə">
pizza
</phoneme>
<-- "pizza" will be pronunced as: pˈiːt̚sə -->
and
<phoneme alphabet="ipa" ph="spəgˈet̬i">
spaghetti.
</phoneme>
<-- "spaghetti" will be pronunced as: spəgˈet̬i -->
</speak>
To expand an abbreviation (step 1. above), the
<sub> element will be used. It indicates that the contained
text for pronunciation will be replaced by the alias attribute value.
The Text Analysis step 1 showed how a TTS system
works automatically. As the knowledge contained in lexicon files
(for abbreviation, acronyms, special symbols, etc.) might be not
enough, it is possible use the <sub> element to express the
expansion or abbreviations. Supposing to use a TTS to synthesize
the phrase: "The SSML v. 1.0 is a new standard of W3C"
the TTS output can be: "The SSML v. one point o is a new
standard of W3C". Using the SSML the synthesis will improve:
Example 4
<?xml version="1.0"?>
<speak version="1.0" xml:lang="en-US">
The <sub alias="Speech Synthesis Markup Language">SSML</sub>
<sub alias="version">v.</sub> 1.0 is a new standard of
<sub alias="World Wide Web Consortium">W3C</sub>.
</speak>
All the acronyms and the abbreviations have been expanded.
In order to control prosody and style (step 5.
above) one may use the <voice> element and its attributes
(xml:lang, gender, variant, age, and name) to request the most suitable
voice, or the <emphasis> and <break> elements, to respectively
increase/decrease emphasis in a sentence and to insert or remove
pauses.
The following is an extract from Hamlet, Act I,
Scene 1. The dialog is rendered changing voices for different characters
on the base of variant attribute. "1" is the narrator;
"2" is Bernardo, an officer; "3" is Francisco,
a soldier.
The example includes some <emphasis> and
<break> elements.
Example 5
<?xml version="1.0"?>
<speak version="1.0" xml:lang="en">
<voice gender="male" variant="1">
Elsinore. A platform before the Castle.
</voice>
<voice gender="male" variant="2">
<emphasis>Who's there?</emphasis>
</voice>
<voice gender="male" variant="3">
Nay, answer me: stand, and unfold yourself.
</voice>
<voice gender="male" variant="2">
<emphasis level="strong"> Long live the king! </emphasis>
</voice>
<voice gender="male" variant="3">
Bernardo?
</voice>
<voice gender="male" variant="2">
He.
</voice>
<voice gender="male" variant="3">
You come most carefully upon your hour.
</voice>
<voice gender="male" variant="2">
'Tis now struck twelve. Get thee to bed, Francisco.
</voice>
<voice gender="male" variant="3">
For this relief much thanks: 'tis bitter cold,
And I am sick at heart.
</voice>
<voice gender="male" variant="2">
Have you had quiet guard?
</voice>
<voice gender="male" variant="3">
Not a mouse stirring.
</voice>
<voice gender="male" variant="2">
Well, good night.
<break/> If you do meet Horatio and Marcellus,
The rivals of my watch, bid them make haste.
</voice>
<voice gender="male" variant="3">
I think I hear them. Stand, ho! Who is there?
</voice>
</speak>
Of course this is a joke. Current TTS is not suitable
for interpreting plays. Nevertheless, some dialogs may be fruitfully
rendered.
With its attributes to define pitch, contour, range,
rate, duration, and volume the <prosody> element might be
helpful for a closer control of intonation. The next example shows
how to use the <prosody> element and its attributes.
Example 6
<?xml version="1.0"?>
<speak version="1.0" xml:lang="en-GB">
Using prosody element it is possible
<prosody rate ="x-fast" >
read fast a sentence
</prosody>
or
<prosody pitch ="+60Hz">
manipulate the voice pith
</prosody>
and other important parameters.
</speak>
The SSML also provides controls for the Speech Synthesis step (see
Figure 1). The <voice> element takes effect in this step,
allowing to choose a specific voice; the <audio> element inserts
an audio file in the synthesized text, and the <mark> element
inserts a bookmark that will be used for control and synchronization
at the application level.
Final remarks
The SSML simplifies the use of TTS in a variety
of contexts and applications. Resources for further study of the
SSML are the W3C specification
itself or tutorials, very good the tutorial written by Jim Larson
(Voice Browser Working Group co-chairman), it is here. If you
want to test SSML documents please use the Loquendo TTS online interactive
demo:here.
--------------------------------------------------------------------------------------------------------------------------------------------------
References:
If you are interested in a deeper analysis of how a speech synthesis
engine is implemented, you are invited to read the following paper:
- [TTS]
- Silvia Quazza, Laura Donetti, Loreta Moisa, Pier Luigi Salza,
"ACTORŽ: A Multilingual Unit-Selection Speech Synthesis System",
Proc. of 4th ISCA Tutorial and Research Workshop on Speech
Synthesis, Atholl, Scotland, 2001. [pdf]
-
- [IPA]
- International Phonetic Association.
See http://www.arts.gla.ac.uk/ipa/ipa.html
for the organization's website.
- ---------------------------------------------------------------------------------------------------------------------------------------------------
|