|
Authors: Davide Bonardo,
Paolo Baggia (Loquendo)
The Speech Synthesis Markup Language (SSML), the new standard way
of producing content to be spoken by a speech synthesis system,
is now a W3C Recommendation (see W3C
Press Release and Testimonials issued on September 8th 2004
or the full specification at www.w3.org/TR/speech-synthesis/).
SSML is an XML-based markup language, which is aimed at controlling
Text-To-Speech (TTS) for a variety of application contexts.
The functional and practical goal of a TTS engine is generating
spoken output from textual or XML-based documents, such
as SSML. TTS is not only useful for people suffering from visual
impairments to read texts and interact with computers, but also
for many applications that employ voice-output technologies.
For example, alarms or announcements (e.g. in a railway station),
interactive telephone services, information services (weather forecast,
traffic conditions, news...), banking, e-mail or fax reading, games,
etc. In all such diverse applications, TTS generates the right spoken
output corresponding to the message to be communicated. In many
cases, the use of TTS can replace recorded prompts from a professional
speaker a lengthy and costly aspect of voice services. Moreover,
TTS can be fine-grain controlled, by changing certain parameters
to improve and optimize acoustic rendering. SSML helps to specify
in detail how to utter a text, a paragraph, a sentence, a phrase,
or a single word.
To be SSML compliant is becoming a must for TTS engines and voice
platforms, because the SSML specification is part of a larger set
of markup specifications for voice browsers, called W3C Speech Interface
Framework, which has been finalized by the Voice
Browser Working Group. Even if the SSML is a "new"
markup language, it is already crucial in many contexts.
The Voice
Extendible Markup Language 2.0 - VoiceXML 2.0, which enables
Web-based development of speech applications, includes and extends
SSML for prompt messages in speech applications. In multimodal applications,
the SSML
is mandatory both for the Speech
Application Language Tags - SALT prompts and for the XHTML+Voice
- X+V language, which reuses a modularized VoiceXML embedded into
XHTML visual pages. The SSML is prescribed for prompts in the Synchronized
Multimedia Integration Language - SMIL and in the Media
Resource Control Protocol - MRCP interaction protocol.
Finally, the Cascading Style Sheets, employed to increase standard
visual documents (like HTML), is developing a new CSS3
Speech Module, which is designed to produce SSML content. Using
CCS3, will make listening to a Web page instead of reading it very
easy indeed.
1. Inside a TTS engine
A look at the basics of TTS technology might be useful in order
to understand how the SSML reading instructions may affect the behaviour
of a TTS engine. While different TTS systems may adopt different
algorithms to automatically obtain the correct output audio signal,
a generic synthesis process can be described as shown in Figure
1.
The
Text Analysis step is aimed at converting the input text into a
more detailed and precise description of what should be pronounced.
The exact specification of the sounds (phonemes) to be synthesized
and their intonation requires an analysis of the text which solves
its ambiguities, interprets punctuation, symbols, acronyms, etc.
Once the output to be pronounced has been specified, the Speech
Synthesis step generates the actual speech signal. Different
techniques may be applied. The most effective in yielding high-quality
and natural-sounding speech is the Unit Concatenation synthesis
technique, which relies on a database of human speech samples from
which the most suitable segments are extracted and combined to match
the input text [1].
SSML makes it possible to both direct the Text Analysis step and
to specify a number of the features of the voice used by the Speech
Synthesis step. It may be useful to look more closely at the Text
Analysis step, where SSML can play a major role. To obtain a precise
description of the pronunciation of the input text, the TTS engine
has to perform the following tasks:
- Text normalization
The input text needs to be segmented into sentences and words,
relying on blanks and punctuation marks. Numbers, special symbols,
acronyms, abbreviations, have to be conveniently expanded to normalize
the text into a
standard format, consisting of graphemes (the words written in
the text) and punctuation marks.
For example, in this input text to be spoken: "Dr. John Smith
lives at Jefferson dr. 94, Paris, Texas." will be normalized
into: "Doctor John Smith lives at Jefferson drive ninety-four
{pause} Paris {pause} Texas {pause}" Currencies will also
be expanded. If the input text is: "The winner will receive
$1000 cash!" it will be normalized into: "The winner
will receive one thousand dollars cash! {pause}."
- Word pronunciation (lexical stress, phonetic
transcription)
Each word must be assigned its lexical stress and should be converted
into a phonetic representation. Rules and lexicons are designed
for this purpose.
The letter-to-sound (grapheme-to-phoneme) correspondence is relatively
direct for some languages, while it can be highly unpredictable
for others, like English. In these cases word pronunciation can
be considered lexicondependent.
An example of a grapheme-to-phoneme transformation is the following:
"The page cannot be displayed!" will be transcribed
as a sequence of phonemes: /d./ /p.e../ /k.an..t./ /b.i./ /d.spl.e.d/
(the phoneme alphabet used in this
example is IPA, that stands for International Phonetic Association
[2]).
In addition to the rules and lexical knowledge implemented in
the TTS engine, user-defined exceptions usually can be added in
order to improve the TTS pronunciation on the application domain.
It is possible to put a phonetic
transcription directly in the text or to use a lexicon file containing
all the transcriptions, acronyms and abbreviations (see point
1.).
- Sentence structure
Intonation and rhythm, and in some cases also stress and phonetic
transcription, depend on the role of words in the sentence, i.e.
on sentence structure. What is reflected in prosody (prominence,
phrase separation, pauses, intonation) is the meaning of the sentence,
but for synthesis systems the bridge between text and prosody
is generally syntax. Syntax-based prosodic rules are usually implemented
in TTS engines in order to determine the position of breath pauses,
the relative prominence of words and the most suitable intonation.
- Sentence-level modifications in phonetic
transcription
Phonetically transcribed words need to be concatenated in the
sentence. This requires some modifications in the phonetic transcription,
accounting for changes due to continuous speech, also called co-articulation
phenomena
(phoneme change, reductions, de-accentuation.)
- Computation of prosodic parameters
The analysis of a text in terms of a sequence of phonemes is completed
when their prosodic description is given. Based on the previously
determined general prosodic structure (step 3. above,) the target
intonation and rhythm
are assigned to the phoneme sequence and expressed either with
numerical values or with prosodic labels, depending on the synthesis
technique adopted. Prosodic value assignment can rely on explicit
rules or be performed
by an automatic learning system.
SSML directs all Text Analysis steps, providing a standard way
to control aspects of speech such as pronunciation, acronym expansion,
volume, pitch, rate, range, duration, pause, emphasis, etc., across
different synthesis-capable platforms.
2. Basic description of SSML language
The following simple and explicit example, may prove useful as a
reference for understanding SSML syntax.
Example 1
<?xml version="1.0"?>
<speak version="1.0" xml:lang="en-US">
<voice name="Dave">
Hello,
world; my name is Dave.
</voice>
</speak> |
In this example the voice named "Dave"
is to pronounce the following sentence: "Hello, world; my name
is Dave." Like other XML-based markup languages, SSML is composed
of elements. All elements can have some attributes, which will be
briefly described in the following.
The SSML language root element is <speak>
it contains the SSML text to be spoken. The <speak>
element has two required attributes: xml:lang
(necessary to specify the language to be used for speaking) and
version (indicating the version of the specification, currently
"1.0"). Besides these two attributes, there are a few
optional ones, such as: xmlns, used to
indicate the XML Schema for SSML namespace; xsi:schemalocation,
which is useful to indicate the location of the XML Schema; xml:base,
for specifying the base URI of the root document.
The following table shows which SSML elements are associated to
the five points of Text Analysis described before; the single elements
will be described below.
|
Step
|
Description
|
SSML elements
|
|
1.
|
Text Normalization: |
<sub>,
<say-as>,
<p>,
<s>,
<lexicon> |
|
2.
|
Word Pronunciations:
lexical stress and phonetic transcription |
<phoneme>,
<lexicon>,
xml:lang attribute |
|
3.
|
Sentence Structure |
<break> |
|
4.
|
Sentence-level modification in phonetic transcription |
<emphasis>,
<break> |
|
5.
|
Computation of prosodic
parameters |
<prosody> |
Table 1 - SSML elements for controlling Text Analysis step
All other elements composing the SSML language and
the text to be spoken are contained within the root <speak>
element, but the <meta>, <metadata>
and <lexicon> elements must occur
immediately after the root element. <meta>
and <metadata> elements help to
annotate the document (i.e. author, date, etc.).
This is an example specifying a lexicon which helps to transform
the abbreviations commonly used in SMS text messages into a normalized
and speakable form.
Example 2
<?xml version="1.0"?>
<speak version="1.0" xml:lang="en-US">
<lexicon
uri="file://localhost/share/SMSLexicon.lex"/>
<!--
This is the text of a SMS to be read using a
lexicon
file. -->
I
love U :-)
</speak> |
The main elements for expressing the text structure (step 3. above)
are: <p>, representing a paragraph,
and <s>, representing a sentence.
Another one is the <phoneme> element,
which provides in the ph attribute a phonemic/phonetic pronunciation
for the contained text. It is possible to choose different phonemic/phonetic
alphabets using the alphabet attribute ("ipa" for the International
Phonetic Alphabet, see [2]). IPA alphabet
is mandatory, but different engines can support other pronunciation
alphabets (e.g. use "x-loquendo"
for Loquendo alphabet).
Example 3
<?xml version="1.0"?>
<speak version="1.0" xml:lang="en-US">
Tipical italian dishes are
<phoneme alphabet="ipa"
ph="pˈiːt̚sə">
pizza
</phoneme>
<-- "pizza" will be pronunced
as: p.i.t.s. -->
and
<phoneme alphabet="ipa"
ph="spəgˈet̬i">
spaghetti.
</phoneme>
<-- "spaghetti" will be pronunced as: sp.g.et.i
-->
</speak> |
To expand an abbreviation (step 1. above), the <sub>
element will be used. It indicates that the contained text for pronunciation
will be replaced by the alias attribute value.
The Text Analysis step 1 showed how a TTS system works automatically.
As the knowledge contained in lexicon files (for abbreviation, acronyms,
special symbols, etc.) might be not enough, it is possible use the
<sub> element to express the expansion
or abbreviations. Supposing to use a TTS to synthesize the phrase:
"The SSML v. 1.0 is a new standard of W3C" the TTS output
can be: "The SSML v. one point o is a new standard of W3C".
Using the SSML the synthesis will improve:
Example 4
<?xml version="1.0"?>
<speak version="1.0" xml:lang="en-US">
The <sub alias="Speech
Synthesis Markup Language">SSML</sub>
<sub alias="version">v.</sub>
1.0 is a new standard of
<sub alias="World Wide
Web Consortium">W3C</sub>.
</speak> |
All the acronyms and the abbreviations have been expanded. In order
to control prosody and style (step 5. above) one may use the <voice>
element and its attributes (xml:lang, gender, variant,
age, and name) to request the most suitable voice, or the <emphasis>
and <break> elements, to respectively
increase/decrease emphasis in a sentence and to insert or remove pauses.
The following is an extract from Hamlet, Act I, Scene 1. The
dialog is rendered changing voices for different characters on the
base of variant attribute. "1" is the narrator; "2"
is Bernardo, an officer; "3" is Francisco, a soldier.
The example includes some <emphasis> and
<break> elements.
Example 5
<?xml version="1.0"?>
<speak version="1.0" xml:lang="en">
<voice gender="male" variant="1">
Elsinore. A platform before the Castle.
</voice>
<voice gender="male" variant="2">
<emphasis>Who's there?</emphasis>
</voice>
<voice gender="male" variant="3">
Nay, answer me: stand, and unfold yourself.
</voice>
<voice gender="male" variant="2">
<emphasis level="strong"> Long live the king!
</emphasis>
</voice>
<voice gender="male" variant="3">
Bernardo?
</voice>
<voice gender="male" variant="2">
He.
</voice>
<voice gender="male" variant="3">
You come most carefully upon your hour.
</voice>
|
<voice gender="male" variant="2">
'Tis now struck twelve. Get thee to bed, Francisco.
</voice>
<voice gender="male" variant="3">
For this relief much thanks: 'tis bitter cold,
And I am sick at heart.
</voice>
<voice gender="male" variant="2">
Have you had quiet guard?
</voice>
<voice gender="male" variant="3">
Not a mouse stirring.
</voice>
<voice gender="male" variant="2">
Well, good night.
<break/> If you do meet Horatio and Marcellus,
The rivals of my watch, bid them make haste.
</voice>
<voice gender="male" variant="3">
I think I hear them. Stand, ho! Who is there?
</voice>
</speak> |
This is, of course, a joke. Currently available TTS is not suitable
for interpreting plays. Nevertheless, some dialogs may be fruitfully
rendered. With its attributes to define pitch, contour, range, rate,
duration, and volume the <prosody>
element might be helpful for a closer control of intonation. The next
example shows how to use the <prosody>
element and its attributes.
Example 6
<?xml version="1.0"?>
<speak version="1.0" xml:lang="en-GB">
Using
prosody element it is possible
<prosody
rate ="x-fast" >
read
fast a sentence
</prosody>
or
<prosody
pitch ="+60Hz">
manipulate
the voice pith
</prosody>
and other
important parameters.
</speak> |
The SSML also provides controls for the Speech Synthesis step (see
Figure 1). The <voice> element takes
effect in this step, allowing to choose a specific voice; the <audio>
element inserts an audio file in the synthesized text, and the <mark>
element inserts a bookmark that will be used for control and synchronization
at the application level.
3. Final remarks
The SSML simplifies the use of TTS in a variety of contexts and applications.
Resources for further study of the SSML are the W3C
specification document itself or tutorials, among which a very
good one written by Jim
Larson, Voice
Browser Working Group co-chairman.
If you would like to test SSML documents please use the Loquendo
TTS online interactive demo.
4. References
If you are interested in a deeper analysis of how a speech synthesis
engine is implemented, we invite you to read the following paper:
[1] Silvia Quazza, Laura Donetti, Loreta
Moisa, Pier Luigi Salza, "ACTOR®:
A Multilingual Unit-Selection
Speech Synthesis System", Proc. of 4th
ISCA Tutorial and Research Workshop
on Speech Synthesis, Atholl, Scotland, 2001.
See: http://www.loquendo.com/en/brochure/art_TTS_2001.pdf
[2] International Phonetic Association.
See http://www.arts.gla.ac.uk/ipa/ipa.html
for the organization's website. |