Pronunciation Correction Competition Results
August 30, 2008 at 4:20 pm | Posted in text to speech, Text2Go | 2 Comments
During the recent beta for Text2Go 3.0 we ran a competition to see who could submit the most number of corrections. The winner was Brad Isaac, author of the popular goal setting blog Persistence Unlimited. Brad has been sending in a steady stream of corrections since the beta went live. Thanks very much for your contributions. Brad received a set of Sennheiser Headphones for his efforts.
Today I thought I’d take a look at the pronunciation correction dictionaries to see how many new entries have been added since the official launch of Text2Go 3.0.
Here are the statistics.
108 new corrections
4 new text cleanup rules
1180 new words identified as correctly pronounced
Not bad, considering Text2Go 3.0 has only been released for 3 weeks. Thanks once again to all those who have contributed corrections, especially Brad!
Text2Go 3.0 Released – Pronunciation correction done right!
August 7, 2008 at 9:19 pm | Posted in text to speech, Text2Go | Leave a commentFinally! It’s taken a lot longer than I expected. Software estimation proves once again to be an elusive art. The major new feature can be summed up as ‘Pronunciation correction done right’. Ever since I discovered text to speech technology I’ve been bugged by mispronunciations. Although quite rare, they tend to stand out in a document that’s being narrated. They’re especially grating if they occur multiple times in the same document. For this reason, most text to speech applications provide a way to enter corrections. The previous release of Text2Go provided this ability but it required the user to edit XML files and restart Text2Go each time. Not very user-friendly! It was a stop-gap solution until could find the time to implement a proper solution.
That time has come. When I first designed Text2Go I had a lot of ideas on how to efficiently identify and correct mispronunciations. With this release I’ve been able to put these ideas into practice. This has been very satisfying.
One of the first challenges is finding a way of efficiently identifying mispronunciations. Pronunciation errors are actually quite rare. The naive approach is to listen to a document from start to finish, noting down any mispronunciations as you go. You can then come back and enter corrections for the next time the offending words are encountered. There are a couple of major problems with this approach.
The first is that you end up listening to the entire document, complete with mispronunciations. You’ll only get the benefit of the corrections you’ve entered the next time these words occur.
The second problem is the approach is incredibly inefficient. All documents are filled with high frequency words such as ‘a, is, the, and, in’ etc. These are never mispronounced but you have to listen to them over and over.
I wanted an approach that could identify and correct mispronunciations before listening to a document and was quick and efficient. So I came up with the following.
First, extract a list of words from the document and remove all duplicates. This single step means you only have to listen to a word at most once, no matter how many times it appears in the document.
Taking this one step further, once you’ve listened to a word and verified it to be correctly pronounced, it would be nice to be able to remember this so that you never have to check it again. This is particularly useful for eliminating the high frequency words mentioned above. Therefore Text2Go maintains a ‘white-list’ of correctly pronounced words. These are filtered from the document being checked, again significantly reducing the number of words requiring checking.
Of the remaining words, it would be nice to be able to identify the most likely to be mispronounced. The approach I’ve chosen is to spell-check the remaining words. Misspelt (or unrecognized) words are then placed on the top of the list. The reason is that brand names, jargon and slang that haven’t made it into the dictionary are more likely to be mispronounced. Of course correctly spelt words can also be mispronounced and unrecognized words correctly pronounced. It’s just a way of increasing the likelihood of identifying mispronunciations.
Another strategy is to identify compound words (i.e. two words run together) as I’ve discovered these are more likely to be mispronounced. The way I identify compound words is to find all words that are made up of exactly two correctly spelt words. Unfortunately this generates a number of false positives (e.g. ration = rat + ion). It’s still a useful strategy but I could make it more effective if I could find a better way of identifying compound words.
Once you have a list of words you wish to check, Text2Go will speak each word in turn. If you do nothing, the word will be marked as correct. These words can then be added to the ‘white-list’ so they need never be checked again.
If you hear a word that is mispronounced, you can mark it as such with a click of the mouse. Once all words have been spoken, each will be either marked as correct or incorrect. Now all you need to do is enter corrections for each of the mispronounced words. These will then be added to the pronunciation dictionary.
This approach makes it very easy to check just a few words or a large list. You can watch a video of this in action here.
Once you’ve gone to the effort of identifying and correcting the pronunciation of a set of words or even if you’ve just verified a list of words, it would be nice if you could share this information with other Text2Go users. Others will gain the benefits of your corrections and you will gain the benefit of theirs. A win-win situation. This will result in a much larger pronunciation dictionary and in turn lead to more accurate text to speech.
To achieve this I wanted the sharing to require no extra effort on the part of the user. Therefore I’ve created an automatic-update like service that runs every couple of days. It runs completely in the background, requiring no interaction from the user. In fact you can continue to use Text2Go while it runs. First it downloads new pronunciation entries and white-listed words form the Text2Go web server. Then it uploads any corrections and white-listed words you entered locally. These are then merged and made ready for distribution in the next update.
The other major area of functionality I’ve enhanced for this release is Text Cleanup Rules. A Text Cleanup Rule is a power search and replace operation (using regular expressions) that gets applied to a document before it’s converted to text.
One example where Text Cleanup Rules can be useful is in identifying breaks in a document and inserting a pause. For example, a row of ******** or ————- is often used to denote a break in a document. By default these breaks would be pronounced as asterisk, asteriskk, asterisk…. and minus, minus, minus… This very quickly becomes tiresome.
Text2Go includes a rule to identify these breaks and replace them with a pause. A single rule can handle both forms of break and will match two or more *’s or -’s, with or without spaces in between.
In the previous version of Text2Go you could only create these by editing XML files. For this release I’ve added a built-in editor. The editor allows you to test your rule on a sample block of text as you edit it. Text Cleanup rules are also shared in the same way as pronunciation corrections. You can watch a video of the new editor in action here.
Finally I’ve added a few minor enhancements.
Clipboard Monitor. When you turn on the Clipboard Monitor, Text2Go will automatically add any text copied to the clipboard to the current document. Very convenient when converting text from PDFs, Word documents, email, etc.
Motor-Mouth. Works the same way as the Clipboard Monitor, except that instead of adding text to the current document, it speaks it aloud.
Status Display in the System Tray. In addition to displaying the current Text2Go status on the toolbar in Internet Explorer, it’s also displayed in the icon in the system tray (icon in the bottom right of the screen near the time).
Option to control Whether Text2Go is Started at PC Startup Time. By default Text2Go is started when you boot your PC, but for those who only use Text2Go occasionally, you may prefer not to have it started every time.
This release has been very satisfying to me personally. However I’m afraid that it may have been a little self-indulgent. To ensure this is not the case for the next release, I’m running a 10 Second Poll so you can vote on the next major feature you’d like to see added to Text2Go. Please take the time to vote.
You can download Text2Go 3.0 here.
Important Tip for RealSpeak Samantha, Serena and Tom Voice Users
July 30, 2008 at 9:48 pm | Posted in RealSpeak, text to speech, Text2Go, Uncategorized | Leave a commentThe other day I needed to splice some voice samples together for my post on RealSpeak Voice Pronunciation. I was using the free audio editing tool Audacity and happened to notice something disturbing about the waveform that had been generated. I was using the RealSpeak Samantha voice and it was quite clear that a certain amount of audio clipping had occurred.
You can see this in the regions I’ve highlighted in red, where the natural shape of the waveform looks to be cutoff or clipped.
If we zoom right in so the individual waves are visible, you can clearly see that each peak has been chopped off.
Does this matter?
Yes. I’m no audio expert but we’re actually throwing away part of the signal and this will produce some audio distortion.
Can it be fixed?
Yes. The fix is as simple as adjusting the volume of the voice (don’t confuse this with the volume on your PC). You can adjust the volume of an individual voice using the Text2Go Options page. By default, the volume of all voices is set to Normal. By lowering this a couple of notches, the output for Samantha will no longer be clipped.
Converting the same text to speech produced the following waveform.
You can see that the waveform is no longer clipped at the top or the bottom.
Similarly, when we zoom in, each peak is nicely rounded and no longer chopped off.
Do other RealSpeak Voices suffer the same problem?
Serena and Tom also suffer some clipping, so if you use these voices make sure you adjust the volume setting down one or two notches. The other RealSpeak voices are not clipped at the Normal volume setting and don’t need to be adjusted.
Evaluating RealSpeak Voice Pronunciation
July 29, 2008 at 10:30 pm | Posted in RealSpeak, text to speech | 2 Comments
During the course of adding a pronunciation editor to Text2Go, I’ve discovered some of the strengths and weaknesses of the RealSpeak voices when it comes to pronunciation. Pronunciation errors are quite rare, making it hard to build up a large collection of mispronounced words. Text2Go’s new pronunciation editor makes this very easy.
Now that I’ve identified an extensive list of mispronounced words, it’s possible to spot some trends and discover which voice is the most accurate.
Firstly, I’ve found that compound words can cause problems (e.g. afterword, longterm, screenshot
). Most common compound words are fine but often brand names that are made up of two words run together can be mispronounced. It’s very easy to correct these mispronunciations – you just separate the two words with a hypen or space (e.g. after-word, long term, screen-shot
). This occurs often enough that I’ve added a way to identify compound words in the pronunciation editor. I’ve found that Samantha is significantly better at pronouncing compound words that all the other RealSpeak voices.
A similar problem occurs with words having the prefix re-. For example reprogram, repurposed, rereleased
. In these cases the re- is not identified as the re- prefix. Again the solution is simple, just add a hypen after the re (e.g. re-program, re-purposed, re-released
). Once again, Samantha does a better job of pronouncing re- prefixed words.
In order to hear the differences for yourself, I’ve chosen 10 mispronounced words and 4 voices. The table below contains each voice’s attempt to speak the word without any correction applied. Note – I’ve used a dash to indicate a passable but not perfect pronunciation.
| Word | Samantha | Karen | Daniel | Tom |
| adobe | ![]() |
![]() |
![]() |
![]() |
| anticlimactic | ![]() |
![]() |
![]() |
![]() |
| biopic | ![]() |
![]() |
![]() |
![]() |
| signup | ![]() |
![]() |
![]() |
![]() |
| martian | ![]() |
![]() |
![]() |
![]() |
| packrat | ![]() |
![]() |
![]() |
![]() |
| resell | ![]() |
![]() |
![]() |
![]() |
| salesforce | ![]() |
![]() |
![]() |
![]() |
| spokesperson | ![]() |
![]() |
![]() |
![]() |
| wildlife | ![]() |
![]() |
![]() |
![]() |
| Uncorrected | ![]() |
![]() |
![]() |
![]() |
| Corrected | ![]() |
![]() |
![]() |
![]() |
The Uncorrected row contains the voice’s uncorrected pronunciation attempt and the Corrected row contains the pronunciation after corrections have been applied. Notice once corrections have been applied, all voices pronounce all words correctly.
One set of results that surprised me were those for Tom. When I started writing up this post I was sure that Samantha was way ahead of the other voices. However this result shows that Tom is also a worthy contender. I’m still sure that Samantha has the most accurate pronunciation but the margin is not as great as I imagined.
My hunch is that Samantha is based on slightly newer technology and it’s the reason why the Samantha voice file is around 110MB in size whereas the others are around 70-90MB.
So does this mean that Samantha is the best voice and the one I should always use? What about regional differences?
Do voices from different regions pronounce words differently? Most definitely! Take the Australian voices Karen and Lee as examples. Not only do they have Australian accents, they correctly pronounce local Australian place names, whereas the other English voices can be way off. Listen to the following Australian place names (of aboriginal origin) spoken first by Samantha (US English) and then Karen (Australian English) 
Pronunciation is only one criteria on which to choose a voice. I believe it’s more important to choose a voice you enjoy listening to. If you like the sound of Samantha then definitely choose her but if you prefer the sound of one of the other voices, go with them. Remember that pronunciation errors in normal day to day text are quite rare for any of the RealSpeak voices.
I hate to admit it, but watching my child’s swimming lesson is becoming tedious.
March 20, 2008 at 6:55 pm | Posted in iPod, MP3 Player, text to speech, Text2Go | Leave a comment
My daughter currently has a swimming lesson once a week on a Tuesday evening. At first I was very excited, as I was able to knock off work a little early and be home in time to take her to her lesson. A few weeks in and I’m ashamed to admit it but it’s becoming a little tedious.
The problem is that it’s a group lesson with 3 other kids, so my daughter is only in action 25% of the time. It’s only a ½ hour lesson but then I let her have another ½ hour after the lesson to muck around in the pool.
Some of the other parents must also be finding it tough going. They come armed with an array of reading material. Books, magazines and newspapers are commonplace.
The thing is I really like to show my daughter support. If my head is buried in a newspaper when she looks up to see if I’m watching (which she does regularly – perhaps too often if truth be told), she’s going to be very disappointed. I’m also genuinely interested in watching her progress. I find it one of the most satisfying things as a parent, watching my kids develop over time.
Then it hit me. Text2Go is just perfect for this situation. I can listen to an eBook, article or collection of blog posts on my iPod while I keep my eyes on my daughter at all times. During the lesson I can offer encouragement and after the lesson I can just keep an eye on her so that she’s safe while playing in the pool.
This is definitely going to be the plan for next week’s lesson. The biggest problem I foresee is going to be the background noise level. The noise generated by 100 kids in an enclosed pool with concrete walls that bounce the sound back and forth is extreme. The standard iPod earbuds don’t actually block out a lot of outside noise. I know that Sennheiser make some iPod earbuds that include a set of earfit rings of varing sizes. These allow you to find the size that best fits your ear, creating a tighter seal between ear and earbud, and hence blocking out more background noise.
The next level up is to buy some earbuds or headphones that have active noise cancellation. These may be necessary in this situation. I’ll find out next week.
85% of Men Prefer Listening to a Female Voice
January 23, 2008 at 6:20 pm | Posted in RealSpeak, text to speech, Uncategorized | 5 Comments
I had a gut feeling, based on my own preferences that men prefer listening to female voices. I also assumed the opposite was true – women would prefer listening to male voices. Wrong!
To test these assumptions, I dug up some of my recent sales data for computerized voices. When you purchase Text2Go you can also choose to purchase one or more high quality RealSpeak voices. I looked at all purchases that included a single voice. I disregarded any purchase that included both male and female voices and any subsequent voice purchases. I also disregarded all sales of the Indian English female voice Sangeeta as there is no corresponding male Indian English voice.
It’s clear that men have a strong preference for the female voice and it’s what I expected. What is surprising is that women also prefer the female voice, albeit to a much lesser extent. In fact the statistics show woman don’t have a strong preference one way or the other. My wife certainly prefers the male computerized voices, so I’d expected this to be the case for the general population.
Not wanting to let hard data get in the way of assumptions and gut feelings, here are a couple of reasons that may explain the unexpected results for the women.
1. There are characteristics of the female voice that make it easier to produce a more natural sounding computerized voice.
2. We currently sell 7 female voices but only 3 male voices. This extra choice may give some bias to female voices. It may also be an indication that the developers of computerized voices have recognised the popularity of the female voice. Note that the US, UK and Australian accents have both male and female voices.
What’s your preference? Have a listen to the computerized voices on the Text2Go website and let me know.
4 Quick Tips When Converting eBooks from Text to Speech
December 18, 2007 at 9:47 pm | Posted in eBook, text to speech, Text2Go, Uncategorized | 2 Comments
Today I purchased a new eBook ‘As the Mirror Cracks’by Steve Jordan and I thought I’d share a few tips on converting eBooks from text to speech.
1. Check the DRM permissions. In a perfect world people would trust each other and all eBooks would be DRM free. Thankfully Steve Jordan publishes all his books in multiple formats, none of which have any DRM protection. However the majority of eBooks available for sale are DRM-protected and they will cause you a world of pain. DRM-protected works place all sorts of restrictions on how and where you can view your eBook. When converting an eBook to speech, the DRM protection must allow the text to speech operation. Check very carefully before purchasing the eBook that you are granted this right. If it’s not explicitly stated, assume text to speech has been disabled. Even if the eBook allows text to speech, it will only allow it to be performed from within the authorized eBook reader. If this runs on your PC, then you will only be able to listen to the eBook while sitting at your computer. To use a product such as Text2Go to convert an eBook to an MP3 file that you can listen to on the go, the eBook will need to grant you ‘Copy and Paste’ rights. Most don’t, so it’s best just to say no to DRM-protected works.
2. Don’t convert an eBook in one single chunk or you’ll end up with one enormous track. If you lose your place during playback, it will be very hard to find it again as you will need to seek through an enormous file. Instead I create a playlist for the eBook and then split it up chapter by chapter and store each chapter as a track within the playlist. If I lose my place during playback, it’s easy to find the chapter I was up to and then do a quick seek within the corresponding track.
3. Don’t convert an entire eBook upfront. Instead I convert and listen to the first couple of chapters. This allows me to quickly identify any problem areas during the text to speech process. These may be mispronounced words (most common when the eBook contains a lot of jargon, slang or terminology specific to a particular field), or formatting specific to the eBook (e.g. special characters used to denote pauses, or dividers between sections, chapters, etc). I can then add corrections for the mispronounced words to the pronunciation dictionaries and create text cleanup rulesto handle the eBook’s specific formatting. With these in place I will convert the remaining chapters of the eBook.
4. Don’t use the free Microsoft voices. Listening to an entire eBook with one of these voices will not be a particularly pleasant experience. Instead purchase a high quality, natural-sounding voice.
That’s it. Do you have any tips of your own? Stay tuned for a review of ‘As the Mirror Cracks’.
Blog at WordPress.com. | Theme: Pool by Borja Fernandez.
Entries and comments feeds.









