Speech Synthesis Markup Language (SSML) – Microsoft Azure

The Talker WordPress plugin supports Speech Synthesis Markup Language(SSML). SSML is part of a larger set of markup specifications for voice browsers developed through the open processes of the W3C.

You can fine-tune for the following entities:

SSML is possible by using shortcodes inside the WordPress page. A shortcode is a WordPress-specific code that lets you do nifty things with very little effort. Shortcodes can embed files or create objects that would normally require lots of complicated, ugly code in just one line.

Multiple voices

You can use multiple voices on one page. And also you can voice different fragments of the page in different languages. Use the talker-voice tag to speak the text with the specified voice. Each listed voice has its own individual character. The talker-voice shortcode can be used in combination with all other SSML shortcodes.

NOTES

Make sure that there are no spaces between characters in the voice name.

[talker-voice name="en-US-AmberNeural"] Hello! [/talker-voice]
[talker-voice name="uk-UA-PolinaNeural"] Привіт! [/talker-voice]

You can find the list of all available voices on the Microsoft website.

HTML tag attribute

The talker-voice can be used as a tag attribute with the parameters listed above.

<span talker-voice="en-US-AmberNeural"> Hello! </span>

Pause

To create a pause in speech synthesis, use the following shortcode:

[talker-break time="2s"]

Time sets the length of the break by seconds or milliseconds (e.g. “3s” or “250ms“).

Mute

You can turn off the spoken part of the page. Wrap it in talker-mute shortcode to prevent this section of the page from being spoken.

[talker-mute ]Text to be removed from the audio [/talker-mute] 

There are also several other ways to mute a piece of text in a post:

  • Add class="talker-mute" in muted element
  • Add attribute talker-mute="true" in muted element

Say as

This group of shortcodes lets you indicate information about the type of text construct that is contained within the element. It also helps specify the level of detail for rendering the contained text.

Format matching is important for shortcodes in the talker-say-as group. For example, the shortcode will not work if you use the shortcode for numbers(like cardinal), but at least one not-number character is between the opening and closing shortcode.

The talker-say-as shortcode has the required attribute, interpret-as, which determines how the value is spoken.

Say as cardinal

The following example is spoken as “Twelve thousand three hundred forty-five” (for US English) or “Twelve thousand three hundred and forty-five (for UK English)”:

[talker-say‑as interpret-as="cardinal"] 12345 [/talker-say‑as] 
interpret-asformatInterpretation
charactersspell-outThe text is spoken as individual letters (spelled out). The speech synthesis engine pronounces:

[talker-say‑as interpret-as="characters"] test [/talker-say‑as]

As “T E S T.”
cardinalnumberNoneThe text is spoken as a cardinal number. The speech synthesis engine pronounces:

There are [talker-say‑as interpret-as="cardinal"]10[/talker-say‑as] options

As “There are ten options.”
ordinalNoneThe text is spoken as an ordinal number. The speech synthesis engine pronounces:

Select the [talker-say‑as interpret-as="ordinal"]3rd[/talker-say‑as] option


As “Select the third option.”
number_digitNoneThe text is spoken as a sequence of individual digits. The speech synthesis engine pronounces:

[talker-say‑as interpret-as="number_digit"]123456789[/talker-say‑as]

As “1 2 3 4 5 6 7 8 9.”
fractionNoneThe text is spoken as a fractional number. The speech synthesis engine pronounces:

[talker-say‑as interpret-as="fraction"]3/8[/talker-say‑as] of an inch

As “three eighths of an inch.”
datedmy, mdy, ymd, ydm, ym, my, md, dm, d, m, yThe text is spoken as a date. The format attribute specifies the date’s format (d=day, m=month, and y=year). The speech synthesis engine pronounces:

Today is [talker-say‑as interpret-as="date" format="mdy"]10-19-2016[/talker-say‑as]

As “Today is October nineteenth two thousand sixteen.”
timehms12, hms24The text is spoken as a time. The format attribute specifies whether the time is specified by using a 12-hour clock (hms12) or a 24-hour clock (hms24). Use a colon to separate numbers representing hours, minutes, and seconds. Here are some valid time examples: 12:35, 1:14:32, 08:15, and 02:50:45. The speech synthesis engine pronounces:

The train departs at [talker-say‑as interpret-as="time" format="hms12"]4:00am<[/talker-say‑as]

As “The train departs at four A M.”
durationhms, hm, msThe text is spoken as a duration. The format attribute specifies the duration’s format (h=hour, m=minute, and s=second). The speech synthesis engine pronounces:

[talker-say‑as interpret-as="duration"]01:18:30[/talker-say‑as]

As “one hour eighteen minutes and thirty seconds”.
Pronounces:

[talker-say‑as interpret-as="duration" format="ms"]01:18[/talker-say‑as]

As “one minute and eighteen seconds”.
This tag is only supported on English and Spanish.
telephoneNoneThe text is spoken as a telephone number. The format attribute can contain digits that represent a country code. Examples are “1” for the United States or “39” for Italy. The speech synthesis engine can use this information to guide its pronunciation of a phone number. The phone number might also include the country code, and if so, takes precedence over the country code in the format attribute. The speech synthesis engine pronounces:

The number is [talker-say‑as interpret-as="telephone" format="1"](888) 555-1212[/talker-say‑as]

As “My number is area code eight eight eight five five five one two one two.”
currencyNoneThe text is spoken as a currency. The speech synthesis engine pronounces:

[talker-say‑as interpret-as="currency"]99.9 USD[/talker-say‑as]

As “ninety-nine US dollars and ninety cents.”
addressNoneThe text is spoken as an address. The speech synthesis engine pronounces:

I'm at [talker-say‑as interpret-as="address"]150th CT NE, Redmond, WA[/talker-say‑as]

As “I’m at 150th Court Northeast Redmond Washington.”
nameNoneThe text is spoken as a person’s name. The speech synthesis engine pronounces:

[talker-say‑as interpret-as="name"]ED[/talker-say‑as]

As [æd].
In Chinese names, some characters pronounce differently when they appear in a family name. For example, the speech synthesis engine says 仇 in

[talker-say‑as interpret-as="name"]仇先生[/talker-say‑as]

As [qiú] instead of [chóu].

Substitution

Indicate that the text in the alias attribute value replaces the contained text for pronunciation.

The following example is spoken as “World Wide Web Consortium” instead W3C:

[talker-sub alias="World Wide Web Consortium"] W3C [/talker-sub]

You can also use the talker-sub shortcode to provide a simplified pronunciation of a difficult-to-read word.

Emphasis

Used the [talker-emphasis] shortcode to add or remove emphasis from the text contained by the element.

NOTES

The talker-emphasis shortcode should only be used around a full sentence. Enclosing words within a sentence may cause unwanted pauses in speech.

The following example uses the talker-emphasis shortcode to make an announcement:

[talker-emphasis level="moderate"] This is an important announcement [/talker-emphasis]

This shortcode supports an optional “level” attribute with the following valid values:

  • reduced
  • none
  • moderate
  • strong

When the level attribute isn’t specified, the default level is moderate.

Audio file URL

Use the shortcode [talker-file] to display the audio record file URL of the current post/page.

Prosody

Use [talker-prosody] shortcode to customize the pitch, speaking rate, and volume of text contained by the element.

[talker-prosody rate="slow" pitch="-2st"] Can you hear me now? [/talker-prosody]
AttributeDescriptionRequired or optional
pitchIndicates the baseline pitch for the text. Pitch changes can be applied at the sentence level. The pitch changes should be within 0.5 to 1.5 times the original audio. You can express the pitch as:An absolute value: Expressed as a number followed by “Hz” (Hertz). For example, [talker-prosody pitch="600Hz">some text[/talker-prosody].A relative value:As a relative number: Expressed as a number preceded by “+” or “-” and followed by “Hz” or “st” that specifies an amount to change the pitch. For example: [talker-prosody pitch="+80Hz">some text[/talker-prosody] or [talker-prosody pitch="-2st">some text[/talker-prosody]. The “st” indicates the change unit is semitone, which is half of a tone (a half step) on the standard diatonic scale.As a percentage: Expressed as a number preceded by “+” (optionally) or “-” and followed by “%”, indicating the relative change. For example: [talker-prosody pitch="50%">some text[/talker-prosody] or [talker-prosody pitch="-50%">some text[/talker-prosody].A constant value:x-lowlowmediumhighx-highdefaultOptional
rateIndicates the speaking rate of the text. Speaking rate can be applied at the word or sentence level. The rate changes should be within 0.5 to 2 times the original audio. You can express rate as:A relative value:As a relative number: Expressed as a number that acts as a multiplier of the default. For example, a value of 1 results in no change in the original rate. A value of 0.5 results in a halving of the original rate. A value of 2 results in twice the original rate.As a percentage: Expressed as a number preceded by “+” (optionally) or “-” and followed by “%”, indicating the relative change. For example: [talker-prosody rate="50%">some text[/talker-prosody] or [talker-prosody rate="-50%">some textrate="50%">some text[/talker-prosody] .A constant value:x-slowslowmediumfastx-fastdefaultOptional
volumeIndicates the volume level of the speaking voice. Volume changes can be applied at the sentence level. You can express the volume as:An absolute value: Expressed as a number in the range of 0.0 to 100.0, from quietest to loudest. An example is 75. The default is 100.0.A relative value:As a relative number: Expressed as a number preceded by “+” or “-” that specifies an amount to change the volume. Examples are +10 or -5.5.As a percentage: Expressed as a number preceded by “+” (optionally) or “-” and followed by “%”, indicating the relative change. For example: [talker-prosody volume="50%">some text[/talker-prosody] or <prosody volume="+3%">some text[/talker-prosody].A constant value:silentx-softsoftmediumloudx-louddefaultOptional

Say hidden text

Use [talker-say] shortcode to voice text but not display it on the page front end. This means that the text will be displayed in the page editor, voiced by the Speaker but hidden for users.

[talker-say] This text is converted to audio but not displayed to users [talker-say]

Was this article helpful to you?