Designing better voice experiences

July 16th, 2020 Written by Ben Childs

Designers of interactive digital services typically use tools such as collaborative whiteboards for concepting, visual design software for creating interfaces and prototypes for defining interaction. But what approach and methods are available to design the user experience of a digital service delivered not as visual pixels but as an interactive voice?

Note: This article uses examples and techniques learned during the recent design of an automated voice service implemented using Amazon Polly, but the described approach and use of SSML markup would also be applicable when using other speech synthesis engines.

Designing content for voice synthesis engines

The process of artificially producing or simulating human speech is referred to as speech synthesis. Computerised speech synthesis has been around since the 1950s and software developers have implemented it across multiple applications including video games, mobile services, ebook readers and public announcement systems. Until recently, the produced voice experience often sounded fairly robotic and monotone but the latest generation of speech synthesis engines such as Google Text-to-Speech or Amazon Polly utilise artificial intelligence and Big Data to learn and mimic true voice patterns, allowing for much more natural and ultimately more human speech. Text is encoded for speech using the Speech Synthesis Markup Language (SSML), an XML based language where text is wrapped in tags that instruct the speech synthesis engine how to generate sounds, where to pause or where to add emphasis.

Designing the experience of an automated voice service doesn’t require the scope or depth of design of a more visual interface but the UX designer must still consider aspects such as the end-to-end customer journey, consistency of voice menu interactions and best practices for data entry using a phone keypad. Additionally, while the latest speech synthesis engines are already very effective ‘out of the box’ and can be implemented by a software engineer, to maximise the available features and create the best possible voice experience requires an iterative approach and a combination of content design and UX design.

Before considering how to design the final experience, it is first helpful to deconstruct audio communication into three parts – language, voice and speech:

  • Language is perhaps the easiest to understand – the spoken or written form of human communication that enables the transfer of knowledge and expression through mutually understood systematic conventions. In the context of an automated voice service, language plays a role in setting the tone and style of the communication – which from a digital services perspective is essentially the ‘content’.
  • Voice can be regarded as the physical process in animals using the lungs and vocal tract to produce sounds. Whilst the voice can be used to generate speech, it is helpful to recognise it is also used to generate growls in animals, noises in infants or laughter, song and crying in adults. Attributes of a voice include how high or low the frequency (pitch) or how loud it is (volume).
  • Speech can to some extent be considered the bridge between language and voice, though the scope of the output is broader and more subtle than words alone, encompassing other expressions and intonation. Like voice, speech is also a physical process, albeit exclusive to humans, that is created using intricate muscle control in the head, neck, chest and abdomen. Attributes of speech include pronunciation, intonation and accents.

Therefore, from a more typical design perspective, the language aspect of an automated voice service can be considered as the content, informed where available by brand assets such as ‘tone of voice’ guidelines. The voice and speech aspects are effectively the interface, where the execution of the content is controlled by marking up the text using SSML tags. However, to produce the most natural voice experience requires an iterative approach that utilises both content design and interaction design expertise.

Customising the voice experience using Amazon Polly

Recognising that the gender, character and nationality of a voice can significantly affect the experience, most speech synthesis engines provide multiple voices to choose from. Amazon Polly uses two underlying technologies, with neural voices being slightly higher quality and more natural sounding than the standard voices. There are over 50 voices, three of which have a UK English accent (each using the preferred neural technology). Whilst there are minor differences in features between the neural and standard voices, the approach and markup can mostly be rendered using any of the voices. However, because each voice pronounces content slightly differently and with slightly different speed, it is better to experiment with voices as early as possible and select the final voice before proceeding with any granular customisation.

Amazon provide detailed documentation for the SSML tags supported by Amazon Polly. Most of the tags affect how a sequence of words are delivered rather than controlling the actual creation of a sound. For this reason, it can be helpful when considering how to use the SSML tags to describe the process as curating rather than creating the automated voice – mostly using the tags to direct the voice rather than trying to manipulate every individual sound. All of the SSML tags supported by Amazon Polly are useful, but the following tags provide the biggest impact when trying to create a great automated voice experience:

  • <speak> Documentation

    All SSML content must be wrapped in the <speak> tag at the outer level.
  • <amazon:auto-breaths> Documentation

    Whilst it is possible to describe every breath manually using the <amazon:breath> tag, for most text with appropriate and well structured punctuation, just enabling the auto-breaths feature for a whole text will greatly improve the impact.
  • <p> Documentation

    Text should be structured into related sections using the paragraph tag (as used in HTML for webpages). When combined with the auto-breaths feature, it is usually most effective to keep to very short paragraphs – sometimes only one sentence – as this enables Amazon Polly to add more natural sounding pauses and breaths.
  • <break> Documentation

    Although the auto-breaths feature is highly effective, there are frequently times when a specific pause is required to add clarity or provide time for user input. Like many of the SSML tags, the tag can be customised using a preset attribute such as strength=”medium” or with a value attribute such as time=”500ms”.
  • <emphasis> Documentation

    The emphasis tag provides a simple means to add prominence to a single word or sequence of words, offering three levels of emphasis applied using a preset attribute such as level=”strong”. However, in practice, the strong level can cause Amazon Polly to sound less natural so the <prosody> tag typically provides greater control and a better outcome.
  • <prosody> Documentation

    Prosody includes the parts of speech that don’t involve the pronunciation of words or syllables, such as intonation, rhythm and stress. The <prosody> SSML tag provides three attributes affecting volume, speaking rate and pitch, each supporting both presets and specific values. The attributes can be used separately but more frequently are used in conjunction with each other. For instance, an increased pitch often sounds more natural if spoken slightly faster, and could be achieved using both the pitch=”high” and rate=”fast” attributes.
  • <say-as> Documentation

    The Amazon Polly pronunciation is highly effective at most everyday language but adding extra context helps produce more effective speech where the text represents a specific format such as a date or currency value. Furthermore, due to Amazon Polly’s localised voices, understanding the format ensures the most appropriate outcome for that locale. For instance, by adding date context to the text ‘31/03’ using the attribute interpret-as=”date”, Amazon Polly will say “31st March” in UK English or “March 31st” in American English. Similarly, whilst Amazon Polly is trained to recognise common acronyms and speak them as separate letters, it is more reliable and sometimes essential to provide context that a word is an acronym using the attribute interpret-as=”characters”.
  • <sub> Documentation

    Although used less frequently than <say-as>, for specific subject areas that frequently use coded terms such as science or education, more natural results can be achieved by explicitly telling Amazon Polly how to refer to a phrase. As a reader we might have suitable context to understand that when presented with the text “au” it could represent the country Australia or the chemical symbol for gold. To ensure Amazon Polly achieves the correct outcome, context can be added using the attribute alias=”gold” whilst leaving the original text unchanged. For generating text dynamically relating to a specific subject or domain, this could be achieved programmatically by using a ‘lookup’ dictionary to ensure coded terms are always contextualised with an alias.
  • <phoneme> Documentation

    Although rarely required for most contexts and content, when Amazon Polly doesn’t pronounce a word properly the <phoneme> tag provides a means of describing how to make the correct sounds. Every individual phonetic sound is described in much the same way as a pronunciation is described in a dictionary or foreign language book, which Amazon Polly then concatenates to pronounce the word sound. In practice, the outcome can be less than perfect but the ability to construct a word sound manually is essential for the rare scenarios where the default pronunciation of a word fails.

There are further tags available for curating Amazon Polly speech but the above represent the core building blocks that are required to enhance the effectiveness of most text. The following code demonstrates each of the above SSML tags in use, or listen to an example of each tag in use using the Amazon Polly ‘Amy’ UK English voice.

Understanding the communication of speech

Using the above methods, many Amazon Polly implementations will be sufficiently improved just by adding breaths, pauses, pronunciation assistance and emphasis (preferably using prosody). However, the intricacies of natural human speech are further complicated by less tangible qualities such as accent, subtle stresses, colloquial intonation and rhythm. The AI intelligence behind Amazon Polly actually conveys some of these qualities already and it is not always possible to manipulate them directly but by further deconstructing patterns of speech it is possible to achieve even more natural results. These qualities may differ by language, region or the tone of the content, but attempting to establish guidelines to convey these qualities can provide better consistency and a more natural quality to the synthesised speech.

  • Sentence rhythm
    Amazon Polly delivers speech with an intelligent and less robotic style than older speech engines, but inevitably doesn’t have the natural flow of real human communication. A subtle rhythmic pattern to natural speech is to slightly increase the frequency of phrases that bridge between sentences, such as frequently observed at the start of a sentence to segue from the previous one before somebody interrupts the speaker! There is no specific SSML tag to handle this rhythm but it can be curated by wrapping the first few words of a sentence in the <prosody> tag and increasing the rate by 10-20%.
  • Key phrase pronunciation
    Rarely does Amazon Polly fail to pronounce a word quite accurately but occasionally even a very slightly different intonation stands out as ‘unnatural’. This can be most evident in terms which have specific meaning or usage such as brand and product names or topical phrases like ‘coronavirus outbreak’. Furthermore, due to well intended intelligence in the speech engine, sometimes Amazon Polly intonates the same phrase slightly differently if it’s mentioned in quick succession, imitating the deliberate method we use to apply stress in natural speech but with undesirable effects if the phrase has a fixed pronunciation. Examples of this when designing the thinkmoney unified customer service system were references to the (Apple) “App Store” and crucially to the “thinkmoney” brand name itself. To ensure consistent pronunciation of these phrases, the <prosody> tag can be used again to very slightly impact the generated speech. For instance, whenever Amazon Polly needs to speak the thinkmoney brand name it is handled more consistently and identifiably by increasing the pitch and volume attributes very slightly.
  • Curating for regional specificity wherever possible
    Much like in visual digital experiences, small details such as date formats or greeting styles that don’t recognise the context of the user can detract disproportionately from the overall experience. As demonstrated above, the <say-as> SSML tag is provided to allow for better curation of specific and often localised pronunciation. Using dynamic text and SSML tag generation, ideally Amazon Polly text should be localised effectively using this approach to maximise the effect of natural language for the context of the user.

A design process for creating better automated voice experiences

Using the approaches described above, the Amazon Polly automated voice experience can be iteratively improved to provide the most natural possible experience. Use the Amazon Polly TTS Console to test and iterate the voice in real-time (an AWS account is required).

  1. Where possible, create specific text using a focussed content design approach.
  2. Choose an Amazon Polly voice that is appropriate (usually regionally) to the user.
  3. Implement basic SSML tags to create structure (using paragraphs) and natural breathing (using auto-breaths).
  4. Curate the speech more explicitly by iteratively applying SSML tags to enhance the phrasing and pronunciation of the text.
  5. Maximise the natural effect of the speech by applying SSML tags to affect the rhythm, key phrases and regional specificity of the text.

By way of demonstration, let’s consider the context of a simple introductory message to a voicemail system – requiring a personal tone, a greeting, a dynamic summary of the current inbox status and what actions are available. The following code demonstrates the outcome of iteratively improving the experience using the process above, or listen to a ‘before and after’ of the resulting speech using the Amazon Polly ‘Amy’ UK English voice.

Simple steps to create better voice experiences

Modern speech synthesis engines such as Amazon Polly deliver a powerful and effective automated voice experience out of the box and can be enhanced further with only a few simple methods. However, when engaging with users through voice, an ‘interface’ that we start to use as humans in our early years and typically grow to use at a highly advanced and complex level, every minor improvement and attention to minuscule detail can significantly improve the user experience. Speech synthesis engines are currently still imitative, synthetic reproductions of our primary means of communication, but with relatively little effort designers can use the methods described above to iterate the content and voice customisation to make the voice experience more natural. For developers, understanding the methods above could also help with scripted dynamic content, where many of the methods described above could be built into rulesets that could curate the text programmatically.