Speech Synthesis Markup Language (SSML) in the Speaker WordPress plugin

Starting with version 2.0.0, the Speaker plugin supports Speech Synthesis Markup Language(SSML) in WordPress plugin. SSML is part of a larger set of markup specifications for voice browsers developed through the open processes of the W3C.

To work correctly with all functions and plugin shortcodes listed below, the Speaker Utilities plugin must be enabled as well. It is automatically installed with the Speaker plugin.

Notes

This manual was created for plugins version 2.0.0 and higher. If you use plugins of an older version, use the Speaker settings manual for versions 1.0.^.

You can fine-tune for the following entities:

SSML is possible by using shortcodes inside the WordPress page. A shortcode is a WordPress-specific code that lets you do nifty things with very little effort. Shortcodes can embed files or create objects that would normally require lots of complicated, ugly code in just one line.

Multiple voices

You can use multiple voices on one page. And also you can voice different fragments of the page in different languages. Use the speaker-voice tag to speak the text with the specified voice. Each listed voice has its own individual character. The speaker-voice shortcode can be used in combination with all other SSML shortcodes.

Notes

Make sure that there are no spaces between characters in the voice name.

You can find out the name of the voice that you need in the settings:

SSML in the WordPress plugin - Voice name in the plugin settings
Voice name in the plugin settings

The following example greets the male voice in British English, then the female in French, and then the male in Hindi:

[speaker-voice name="en-GB-Wavenet-A"] Hello! [/speaker-voice]
[speaker-voice name="fr-FR-Wavenet-B"] Bonjour! [/speaker-voice]
[speaker-voice name="hi-IN-Standard-C"] नमस्कार! [/speaker-voice]
This example is spoken “Hello! Bonjour! नमस्कार!” different voices and different languages.

You can also find out the name of the voice and listen to samples of all the votes on the Google Cloud website.

HTML tag attribute

Starting from the 3.4.0 version Voices can be used as a tag attribute with the parameters listed above.

Use example Voices like tag attribute

<span speaker-voice="hi-IN-Standard-C"> नमस्कार! </span>

Pause

To create a pause in speech synthesis, use the following shortcode:

[speaker-break time="2s" strength="medium"]
This example is spoken “One… Two”

Time sets the length of the break by seconds or milliseconds (e.g. “3s” or “250ms“).

The Strength sets the strength of the output’s prosodic break by relative terms. Valid values are: “x-weak“, weak“, “medium“, “strong“, and “x-strong“. The value “none” indicates that no prosodic break boundary should be outputted, which can be used to prevent a prosodic break that the processor would otherwise produce. The other values indicate monotonically non-decreasing (conceptually increasing) break strength between tokens. The stronger boundaries are typically accompanied by pauses.

[speaker-break time="2s"]

You can specify only the length of the pause. In this case, the Strength parameter value will be set to default.

Mute

You can turn off the spoken part of the page. Wrap it in speaker-mute shortcode to prevent this section of the page from being spoken.

[speaker-mute] ... [/speaker-mute]

There are other ways to prohibit the scoring of a page fragment. Learn more about it in the Mute part of the text article.

Since the plugin version 3.0, the new attribute tag is available for the shortcode that adds a specific tag for the text you want to mute.

To mute a part of text like paragraph or block, use the shortcodes:

[speaker-mute] ... [/speaker-mute]

or

[speaker-mute tag="div"] ... [/speaker-mute]

To mute a short part of text, use the shortcodes:

[speaker-mute tag="span"] ... [/speaker-mute]

To mute a section, use the shortcodes:

[speaker-mute tag="section"] ... [/speaker-mute]

Do not use block element like <div></div> or <section><section> inside the <span></span>

Say as

This group of shortcodes lets you indicate information about the type of text construct that is contained within the element. It also helps specify the level of detail for rendering the contained text.

Format matching is important for shortcodes in the speaker-say-as group. For example, the shortcode will not work if you use the shortcode for numbers(like cardinal), but at least one not-number character is between the opening and closing shortcode.

The speaker say‑as shortcode has the required attribute, interpret-as, which determines how the value is spoken. Optional attributes format and detail could be used depending on the particular interpret-as value.

Say as cardinal

The following example is spoken as “Twelve thousand three hundred forty-five” (for US English) or “Twelve thousand three hundred and forty-five (for UK English)”:

[speaker-say‑as interpret-as="cardinal"] 12345 [/speaker-say‑as] 
Twelve thousand three hundred forty-five

Say as ordinal

The following example is spoken as “First”:

[speaker-say‑as interpret-as="ordinal"] 1 [/speaker-say‑as] 
The following example is spoken as “First”

Say as characters

The following example is spoken as “C A N”:

[speaker-say‑as interpret-as="characters"] can [/speaker-say‑as] 
The following example is spoken as “C A N”

Say as fraction

The following example is spoken as “five and a half”:

[speaker-say‑as interpret-as="fraction"] 5+1/2 [/speaker-say‑as] 
The following example is spoken as “five and a half”

Say as expletive or bleep

The following example comes out as a beep, as though it has been censored:

Please [speaker-say‑as interpret-as="expletive"] censor this [/speaker-say‑as] text.
This example bleep a piece of text

Say as unit

Converts units to singular or plural depending on the number. The following example is spoken as “10 feet”:

[speaker-say‑as interpret-as="unit"] 10 foot [/speaker-say‑as]
This example is spoken as “10 feet”

Say as verbatim or spell-out

The following example is spelled out letter by letter:

[speaker-say‑as interpret-as="verbatim"] abcdefg [/speaker-say‑as]
This example is spelled out letter by letter

Say as date

The format attribute is a sequence of date field character codes. Supported field character codes in format are {y, m, d} for a year, month, and day (of the month) respectively. If the field code appears once for a year, month, or day then the number of digits expected are 4, 2, and 2 respectively. If the field code is repeated then the number of expected digits is the number of times the code is repeated. Fields in the date text may be separated by punctuation and/or spaces.

The detail attribute controls the spoken form of the date. For detail='1' only the day fields and one of month or year fields are required, although both may be supplied. This is the default when less than all three fields are given. The spoken form is “The {ordinal day} of {month}, {year}”.

The following example is spoken as “The tenth of September, nineteen sixty”:

[speaker-say‑as interpret-as="date" format="yyyymmdd" detail="1"] 1960-09-10 [/speaker-say‑as]
This example is spoken as “The tenth of September, nineteen sixty”

The following example is spoken as “The tenth of September”:

[speaker-say‑as interpret-as="date" format="dm"] 10-9 [/speaker-say‑as]
This example is spoken as “The tenth of September”

For detail='2' the day, month, and year fields are required and this is the default when all three fields are supplied. The spoken form is “{month} {ordinal day}, {year}”.

The following example is spoken as “September tenth, nineteen sixty”:

[speaker-say‑as interpret-as="date" format="dmy" detail="2"] 10-9-1960 [/speaker-say‑as]
This example is spoken as “September tenth, nineteen sixty”

Say as time

The format attribute is a sequence of time field character codes SSML in the WordPress plugin. Supported field character codes in format are {h,msZ1224} for an hour, minute (of the hour), second (of the minute), time zone, 12-hour time, and 24-hour time respectively. If the field code appears once for an hour, minute, or second then the number of digits expected are 1, 2, and 2 respectively. If the field code is repeated then the number of expected digits is the number of times the code is repeated. Fields in the time text may be separated by punctuation and/or spaces. If hour, minute, or second are not specified in the format or there are no matching digits then the field is treated as a zero value. The default format is “hms12“.

The detail attribute controls whether the spoken form of the time is 12-hour time or 24-hour time. The spoken form is 24-hour time if detail='1' or if detail is omitted and the format of the time is 24-hour time. The spoken form is 12-hour time if detail='2' or if detail is omitted and the format of the time is 12-hour time.

The following example is spoken as “Two-thirty P.M.”:

[speaker-say‑as interpret-as="time" format="hms12"] 2:30pm [/speaker-say‑as]
This example is spoken as “Two-thirty P.M.”

Say as telephone

The only limit to the range of characters that can occur within the content and be appropriately interpreted is that imposed by the synthesis processor itself. Some characters that might commonly occur, in addition to the digits 0-9, are separator characters to give a structure to the number itself, a prefix ‘+’, letters that stand for numbers (“1-800-EXAMPLE”), and the characters ‘*’ and ‘#’; of course, these characters are by no means the complete set of characters that may occur.

The optional format attribute can be used to indicate a country code. Values are strings of digits; see [ITU-CC] for a normative list of country codes defined by ITU-T.

[speaker-say‑as interpret-as="telephone" format="1"] (781) 771-7777 [/speaker-say‑as]
This example is spoken as “Plus 1 781 771 7777”

this is a telephone number which is in use in North America.

[speaker-say‑as interpret-as="telephone" format="1"] 1-866-TELLME-1 [/speaker-say‑as]
This example is spoken as “Plus 1 866 8355631”

this is another telephone number which is in use in North America.

[speaker-say‑as interpret-as="telephone" format="1"] +39.800.123456 [/speaker-say‑as]
This example is spoken as “Plus 39 800 123456”

this telephone number is in the country code “39” (that is Italy), even if the country code present in the format attribute does not match it.

[speaker-say‑as interpret-as="telephone" format="81"] 0532441234 [/speaker-say‑as]
This example is spoken as “Plus 81 532441234”

this is a telephone number which is in use in Japan (country code is "81"), for a local number. Read more telephone number uses cases in the W3C SSML for the WordPress plugin specifications.

HTML tag attribute

Starting from the 3.4.0 version Say as can be used as a tag attribute with the parameters listed above.

Use example Say as like tag attribute

<span speaker-say‑as="ordinal" format="yyyymmdd" detail="1"> 1 </span>

Substitution

Indicate that the text in the alias attribute value replaces the contained text for pronunciation.

The following example is spoken as “World Wide Web Consortium” instead W3C:

[speaker-sub alias="World Wide Web Consortium"] W3C [/speaker-sub]
This example is spoken as “World Wide Web Consortium”

You can also use the speaker-sub shortcode to provide a simplified pronunciation of a difficult-to-read word.

HTML tag attribute

Starting from the 3.4.0 version Substitution can be used as a tag attribute with the parameters listed above.

Use example Substitution as tag attribute

<span speaker-sub="World Wide Web Consortium"> W3C [/span]

Emphasis

Used the [speaker-emphasis] shortcode to add or remove emphasis from the text contained by the element.

Notes

The speaker-emphasis shortcode should only be used around a full sentence. Enclosing words within a sentence may cause unwanted pauses in speech.

The following example uses the speaker-emphasis shortcode to make an announcement:

[speaker-emphasis level="strong"] This is an important announcement [/speaker-emphasis]
This example is spokenThis is an important announcement” with a strong emphasis

This shortcode supports an optional “level” attribute with the following valid values:

  • strong
  • moderate
  • none
  • reduced

HTML tag attribute

Starting from the 3.4.0 version Emphasis can be used as a tag attribute with the parameters listed above.

Use example Emphasis as a tag attribute

<span speaker-emphasis="strong"> This is an important announcement </span>

Sentence and paragraph

The Speaker supports paragraph markup like part default SSML in the WordPress plugin. The plugin will automatically intonate and make short pauses at the end of sentences and paragraphs. It is necessary to end the sentence with one of the punctuation marks corresponding to the end of the sentence.

Use one of the punctuation marks to complete the sentence:

  • .
  • !
  • ?

Audio file URL

Use the shortcode [speaker-file] to display the audio record file URL of the current post/page.

Prosody (from version 3.0)

Use [speaker-prosody] shortcode to customize the pitch, speaking rate, and volume of text contained by the element.

[speaker-prosody rate="slow" pitch="-2st"] Can you hear me now? [/speaker-prosody]

There are supported shortcode attributes:

  • rate – a change in the speaking rate for the contained text. Legal values are: a non-negative percentage or “x-slow”“slow”“medium”“fast”“x-fast”, or “default”. Labels “x-slow” through “x-fast” represent a sequence of monotonically non-decreasing speaking rates. When the value is a non-negative percentage it acts as a multiplier of the default rate. For example, a value of 100% means no change in speaking rate, a value of 200% means a speaking rate twice the default rate, and a value of 50% means a speaking rate of half the default rate. The default rate for a voice depends on the language and dialect and on the personality of the voice. The default rate for a voice should be such that it is experienced as a normal speaking rate for the voice when reading aloud text. Since voices are processor-specific, the default rate will be as well.
  • pitch – the baseline pitch for the contained text. Although the exact meaning of “baseline pitch” will vary across synthesis processors, increasing/decreasing this value will typically increase/decrease the approximate pitch of the output. Legal values are: a number followed by “Hz”, a relative change or “x-low”“low”“medium”“high”“x-high”, or “default”. Labels “x-low” through “x-high” represent a sequence of monotonically non-decreasing pitch levels.
  • volume – the volume for the contained text. Legal values are: a number preceded by “+” or “-” and immediately followed by “dB”; or “silent”“x-soft”“soft”“medium”“loud”“x-loud”, or “default”. The default is +0.0dB. Specifying a value of “silent” amounts to specifying minus infinity decibels (dB). Labels “silent” through “x-loud” represent a sequence of monotonically non-decreasing volume levels. 

HTML tag attribute

Starting from the 3.4.0 version Prosody can be used as a tag attribute with the parameters listed above.

Use example Prosody a tag attribute

<span speaker-prosody="" rate="slow" pitch="-2st"> Can you hear me now? </span>

Say hidden text (from version 3.0)

Use [speaker-say] shortcode to voice text but not display it on the page front end. This means that the text will be displayed in the page editor, voiced by the Speaker but hidden for users.

[speaker-say] This text is converted to audio but not displayed to users [speaker-say]

About compatibility

You can combine and use several Speaker shortcodes in one place on the page as well as use multiple shortcodes on the page.
You can also use most of the native SSML in the WordPress plugin inside the markup of your pages. Valid markup will be converted to speech by Speaker.

Was this article helpful to you?