Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Andrew Kozminski
First presented at
AVIOS SPEECH DEVELOPERS
CONFERENCE & EXPO
March 31st - April 3rd, 2003
February 2003
Disclaimer: This paper presents personal views and opinions of the author at this time, which are not binding on Pronexus and are subject to change.
Table of Contents
5 CONCLUSION ...................................................................................................................11
6 ABOUT PRONEXUS ........................................................................................................11
7 GLOSSARY ..............................................................................................................................12
A POWERFUL VOICE IN COMMUNICATION SOLUTIONS
1 Executive Summary
Despite the great progress in speech technology over the past twenty years, speech applications are still not widely
used by carriers and enterprises. Although there are various reasons for the limited deployment of speech applications,
we believe that cost is the main barrier to wider adoption. The great cost of developing and deploying speech applica-
tions stems from several factors, including the complexity, technology inaccuracy and licensing costs, as well as post-
installation efforts.
Building a voice-enabled application is a complex task since it involves the use of multiple core technologies from multi-
ple vendors. The complexity of the application design is also increased by the lower accuracy levels produced by speech
recognition applications (compared to DTMF) and the lack of restriction on the user input, leading to additional program-
ming logic to be dedicated to error compensation and failures.
In addition to design complexities and technology inaccuracies, which require time and human resources, speech appli-
cations also necessitate monetary investments in expensive ASR and TTS licenses. Due to their complicated nature,
speech applications require attention and tuning even after they are deployed (adding and deleting subscribers to gram-
mars for example), increasing the cost of maintenance.
With the above in mind, is it possible to design a lower cost speech application? There are several guidelines that can
help developers to lower system costs and design for the best cost-functionality balance.
Since building a modern speech application requires integration of components provided by different vendors, the most
efficient way to create an application is to use a high-level Rapid Application Development (RAD) tool. RAD tools hide
low-level complexity and provide a uniform development environment for multi-vendor components. In addition, the use
of a development environment will help developers maintain vendor independence when it comes to speech resources
and telephony hardware.
Development and deployment costs can be reduced by using modular architecture. Among other things, modular archi-
tecture allows for independent provisioning and software hot-swaps, eliminating costly downtimes and providing
increased reliability, while enabling for resource sharing. Using royalty-free engines, proper engineering of license man-
ager, and floating licenses can lower the cost of ASR and TTS licenses.
These guidelines provide a framework for lowering the cost of speech-enabled applications. The remainder of this docu-
ment describes them in more detail.
2 Introduction
It's certainly no secret that the last few years have brought some dramatic changes to the high-tech landscape. The
harsh economic reality tested many business formulas, technologies and products. Unfortunately, many of the once
widely hyped ideas did not survive and all of us have learned an important lesson: at the end of the day, the only suc-
cessful products/solutions are those that make money. In other words, it's all about cost, price and, above all, Return on
Investment (ROI).
So, how can we build a compelling ROI case for speech technology? Of course the return depends on the business side
of your application/project. Because this is a technical paper, we'll only briefly touch on this aspect. But the investment is
mostly about technology: it's the cost of the building blocks that you use and the engineering and tuning hours that you
spend. We'll take a closer look at this aspect, because the technical decisions made early in project can have a dramatic
impact on the final cost. No matter how much engineers hate this, more often than not, it is the cost that kills the great
product ideas.
With the above in mind, this paper will review different technologies, architectures and implementation strategies for
A POWERFUL VOICE IN COMMUNICATION SOLUTIONS
Throughout this paper we'll refer to the example of the Airport Assistant, a real life speech-enabled application imple-
mented by one of our partners. Airport Assistant allows callers to select by name one of several hundred airport servic-
es, request real-time flight information and be notified (at a provided cell-number) about any changes in the schedule.
Airport Assistant has been successfully deployed at the biggest airport in Canada - Toronto's Pearson International.
So why, despite almost twenty years of continuous improvements in technology, are speech applications still so slow to
sell? Most analysts agree that the technology itself is "ready for prime time" and that customers are ready to accept its
clear benefits. Where is the problem? Many different reasons have been cited, from numerous false starts of immature
technology, to bad design, to cultural aversion against "talking to a machine". They all are true, but in our opinion, a
major remaining barrier to wider adoption is cost. The remainder of this section discusses the main factors contributing
to the high cost of speech applications today.
Of course, components coming from different vendors are not designed to work easily together, which makes implemen-
tation difficult and prolongs the learning curve. It also requires a team with a diverse and sophisticated skill set: telepho-
ny hardware, protocols, real-time programming and Web development, to name a few. Even today, experienced develop-
ers and voice system designers are not cheap and their salaries quickly add up to the cost of a project.
As an example, the ASR licenses (no TTS was used) for the Toronto Airport Assistant application, described in the intro-
duction, originally accounted for almost 50% of the total system cost. The remaining 50% paid for everything else,
including redundant hardware, development tools, database server, UPS and more. The cost of ASR licenses was later
reduced to less than 23%, by better license management that took advantage of the specific call patterns. This approach
is discussed in more details in the following sections.
With such high licensing costs, it is unfortunate that many commercial speech platforms seem to completely ignore the
issue and continue to use licenses very ineffectively. It is not uncommon that one application port could require two or
more ASR licenses, especially if multiple languages or "always active hot-words" are involved. Obviously, doubling or
tripling the licensing cost has a dramatic impact on the end user price of the finished application.
We've identified cost as one of the main barriers to the wider adoption of speech applications, in particular in mid-market
environments. High up-front costs result in marginal ROI stories, and it doesn't matter how elegant or efficient the appli-
cation is if no one buys it. Thus, it is important to design for the best cost-functionality balance.
Today, system architects and developers of speech-based telephony applications face many difficult choices regarding
platforms, tools, speech technologies, and so on. The abundance of new standards only adds to the overall confusion.
Of course, no single system architecture could meet all possible requirements, but at the same time, selecting the right
architecture has fundamental impact, especially on cost. This section offers a number of specific recommendations
based on our real-life experiences.
Typically, speech building blocks (cards or speech engines) come with native-APIs that is low-level interfaces requiring
programming in C++. But writing a speech application in C++ is not for the faint-of-heart. It may sound like a fun chal-
lenge to developers, but certainly not to a project manager who is responsible for budgeting and scheduling. Low-level
details (such as call state machines, resource management, multi-threading, voice buffering, ActiveX, COM, and sock-
ets) will very quickly defocus developers from solving the actual business problem at hand. Learning APIs and telephony
abstraction models from different vendors will dramatically extend the learning curve. Don't count on the telephony stan-
A POWERFUL VOICE IN COMMUNICATION SOLUTIONS
dards for help: unfortunately, despite multiple attempts (such as TAPI, SAPI, TSAPI, and S.100), there is no universal
standard API for low-level telephony functionality. Adoption by vendors is random at best, and interoperability of particu-
lar implementations can never be taken for granted.
Can this complexity of native-APIs be avoided? Absolutely - by leveraging the work done by others. In practice, this
means using one of the high-level Rapid Application Development (RAD) tools. RAD tools hide low-level complexity and
abstract the mishmash of multi-vendor components into a uniform development environment, focused on the business
logic, not technology. Some RAD tools reduce the learning curve even more by leveraging the power of one of the
industry standard Integrated Development Environments (IDE), for example Microsoft Visual Studio, and popular devel-
opment languages, such as Visual Basic or Java.
Properly selected development tools can save a lot of money. On the Toronto Airport Assistant project, both the develop-
ers and project manager claimed that the switch to an appropriate tool cut the delivery time by more than 50%.
Of course, not all RAD tools are created equally and they should be carefully selected for a specific project and its
development team. Ideally, you should look for a tool that combines a high-level visual design environment with a flexible
programming environment.
Visual Design - Most RAD tools use drag-and-drop GUI interfaces, which increase programmer productivity and
enhance structure and readability of the source code. However, when selecting a tool, consider inherent limitations of
visual design and programming. Ready-to-use building blocks work very well as long as all functionality is available.
Some applications may benefit from the simplicity of this approach; however, most practical applications require function-
ality that goes beyond ready-to-go functions. Sooner or later, you'll need to customize blocks or integrate them with
external third-party components. In other words, choose tools that combine the productivity gains of visual programming
with a powerful programming environment.
Programming Environment -Building speech applications today is all about integration and customization. Any serious
application requires some custom programming. The right development and debugging tools can save your project when
you least expect it. Make absolutely sure that your tools support a serious, industry-standard programming language,
source level debugging, seamless invocation of component libraries (such as DCOM, ActiveX, and CORBA.) and control
over multithreading. If you're building an application on Windows, don't miss out on the benefits of the next-generation
technology from Microsoft- your tool must support .NET!
As a typical example, one of our customers built a large outbound dialing application that depended on answering
machine detection. The initial implementation used the original detection algorithm embedded into Dialogic cards.
Unfortunately, statistical accuracy (which depends highly on the target calling area) couldn't be verified before field trials.
When the first results came in, accuracy levels were around 80%, significantly below expectations. Because their RAD
tool fully integrated into Microsoft Visual Studio, the programmers were able to devise a custom solution that boosted
accuracy to 96%. This would be impossible without a flexible programming environment.
A modular system is cheaper to build and maintain. In development, programmers benefit from working in parallel on
well-defined modules. In production, independent module provisioning and software hot-swaps eliminate costly system
downtimes. At the same time, separating application logic from telephony and speech processing allows resource shar-
ing, which in turn leads to more efficient utilization. Finally, distributing your modules across a LAN enables load balanc-
ing and effortless scalability - again resulting in savings on system maintenance.
A POWERFUL VOICE IN COMMUNICATION SOLUTIONS
The biggest benefit, however, comes from increased reliability of a modular system. Nothing is more frustrating to callers
than a system that crashes into "dead silence" in the middle of a transaction. An unreliable system will be soon pulled
out of production, which always means significant financial losses.
A monolithic executable is only as reliable as its weakest component, while a modular system can stay operational even
after losing one of its modules. Therefore, it is very important that application modules execute properly separated from
each other and from the system processes, so that a fatal error in one doesn't bring down the whole system. The mod-
ules should run out-of-process, or even better, distributed across a LAN. Ideally, modules should be compiled directly
into stand-alone executables, not into intermediate scripts or p-code. Not only does this speed up program execution, it
also removes the dependency on a shared runtime engine as a single point of failure.
Speech Resources - Selecting the right speech components for your application is very important. Speech technology is
still complex and very expensive, but the quality and accuracy of the chosen engines could ultimately make a difference
between success or failure of your project.
In practice, when it comes to speech processing, the "one size fits all" approach does not work. This is true for ASR, but
even more so for TTS - all engines and languages are not equal. Therefore, it pays to carefully research and evaluate
different vendors before deciding on a TTS product. Remember that perception of quality is subjective and may depend
on your audience and application. As a result you may end up working with more than one vendor at a time and chang-
ing vendors as your application evolves. Unfortunately, this may require re-doing the integration work many times, result-
ing in additional cost. If you're lucky, the engine of your choice may support a standard API (like SAPI), but not all do.
And even if it does support SAPI, these interfaces are often far from perfect. Fine variations in timing, buffering schemes
and performance can result in irritating gaps, clicks and delays. Usually, better results are achieved through individually
crafted, native APIs, but this again means additional development costs.
The best way to achieve vendor-independence is to use a middleware abstraction layer, which in turn works with num-
ber of alternate engines. Again, a RAD tool is appropriate: it will protect your investment in application development
should a shift in requirements, technology or vendor strategies necessitate the move to another engine. It will also allow
you to experiment with multiple speech products to find the best price-quality balance for your application.
Telephony Hardware - Similar to speech resources, vendor-independence of telephony hardware can save you money. It
is not uncommon for speech applications to run on more than one brand of hardware or to switch vendors for better pric-
ing. In general, the smaller telephony hardware suppliers tend to be less expensive and much more accommodating
when it comes to technical support plans, and if you're new to computer telephony you will definitely require support. On
the other hand, smaller vendors may not support all cards and protocols. The most popular, i.e. analog T1/E1, ISDN-
PRI, H.323, R2-CAS, etc. are a must. However, it's also worth paying attention to the less obvious capabilities, such as
SIP, PBX set-emulator cards and transfer-on-CO protocols like TBCT, RLT and Q.SIG. Again, make sure that your mid-
dleware framework allows experimenting with different cards and protocols from multiple vendors, and keep in mind that
your requirements may change with time.
PBX Integration - Most analysts predict that call centers will account for the biggest slice of the "speech market pie" in
the coming years. If your organization is engaged in the call-center market, you know that developing applications for
just one PBX-brand is not enough. Given that speech is so much more expensive than touchtone, PBX vendor-inde-
pendence is becoming more important than ever. Again, make sure that your development tools or middleware offer a
good PBX integration story.
A POWERFUL VOICE IN COMMUNICATION SOLUTIONS
What does this mean in practical terms? Today, many PBX vendors are changing their traditional proprietary architec-
tures and have begun opening up to third-party applications. Since PBXs are built for the enterprise (with its strong
Microsoft presence), this trend is most visible in Windows, where in recent years we've observed a renewed interest in
TAPI as the integration technology of choice. As a result, you may expect a complete and well-tested TAPI Service
Provider for almost any switch, especially for modern IP-PBXs. In our opinion, building your speech application on TAPI
is the best strategy for widening your customer base and consequently maximizing your ROI in the call-center market
Royalty Free Engines - Yes, they are available! Companies like Microsoft and Aculab offer license-free ASR and TTS
technologies of high quality that may be perfect for your application. One word of caution: customers may accept a lower
quality TTS (as long as the message is understandable), but they have much less tolerance for imperfect ASR. From our
experience, speech recognition has to work close to perfectly, or it will be deemed useless and dropped. In other words,
carefully evaluate your ASR alternatives.
As with the other elements of your application, maintaining vendor-independence works to your advantage, allowing you,
for example, to experiment with free engines before deciding to spend your dollars on licenses. (As a side note, none of
the speech vendors that we know of accepts returns of purchased licenses). Again, picking a middleware framework that
supports both free and commercial engines is, in our opinion, the best strategy.
One license per channel - If you decide to use a commercial speech engine from one of the industry leaders, invest
some time to properly engineer your license manager - a lot of money can be saved by this effort. There is no technical
reason to use more than one engine license (TTS or ASR) per application channel. Even systems using multiple lan-
guages or parallel grammars to implement hot-words can be designed to use one license per channel at any given time.
Make sure that your middleware doesn't force you to unnecessary double your resources.
Floating licenses - You should also keep in mind that many applications don't require ASR and TTS for the whole dura-
tion of a call. As an example, consider a pre-paid calling-card system. It uses speech recognition to identify callers,
checks account balances and then bridges calls to outbound trunks. In this scenario, speech resources (ASR and TTS)
are only required for a small fraction of a call, possibly as low as 10%. Once a call is bridged, the resources can be
redeployed to serve other channels -- this presents a great savings opportunity. If licenses could float dynamically
between channels, in theory the savings could be as high as 90%. Unfortunately, not many platforms allow floating
speech resources, but some systems do. Given the savings, it pays to ensure that the tool you choose allows for proper
license management to take advantage of the specific calling patterns in your application.
Tuning is much more than just tweaking grammars. It is an iterative process of analyzing system performance and
repeatedly applying the best design practices in order to arrive at the most satisfying user experience and in order to
work around technology imperfections. As a result, the tuning phase can take many months and requires an interdiscipli-
nary team of professionals, including not only developers and testers but also experts in linguistics and often psychology.
The resulting cost is substantial, but in our experience this is money well spent.
Planning for tuning is difficult, because speech systems, unlike touchtone, are highly dependent on the demographics,
local accents, language mix and even the culture of the target audience. Some applications, such as speech-enabled
A POWERFUL VOICE IN COMMUNICATION SOLUTIONS
auto-attendants, may require regular, on-going tuning as the grammar (i.e. a list of employee names) changes over time.
Typically, end users are not able and should not be expected to perform tuning themselves and the application should
be specifically designed for on-going, remote maintenance by the vendor.
Tuning large grammars, such as a city's phone book, tends to be particularly challenging and should be approached with
special caution. The experienced speech technology providers seem to be well aware of the possible problems: one
highly recognized vendor would sell us a ready-to-use grammar containing tens of thousands of names, but would not
venture into signing a contract to get it working in the field.
Unfortunately, saving on tuning is not easy and may jeopardize the final quality of your product. Some savings may be
achieved by employing off-the-shelf speech component libraries such as Nuance Speech Objects or SpeechWorks
Dialog Modules. But in general, tuning is not the area in which to be penny-pinching. In speech recognition applications,
users typically have very little patience for shaky technology. The system has to be almost perfect or it will not be used.
There is no middle ground.
Operating System - The choice of operating system fundamentally impacts many aspects of a speech application.
Today, there are three main choices: Unix (mainly Solaris), Linux or Windows. While discussing the merits of each OS is
beyond the scope of this paper, we will discuss some important considerations specific to telephony and speech.
First, keep your target market in mind. The old bias against Windows still holds strong in some traditional telephony envi-
ronments, especially among carriers in North America. Even recently, we've seen an already completed application
being ported to Solaris after approaching carriers with a Windows version. However, other regions of the world regard
Windows much more favorably. Even in North America, the situation is much different in the enterprise, where Windows
naturally fits into the desktop and business back-end dominated by Microsoft.
The most widely quoted complaints against Windows are reliability and price. We believe that this continued bias is no
longer justified - the modern Windows is reliable enough for speech applications, both for carriers and enterprise. As for
the price, Windows compares favorably to Unix, and while Linux is free, the price of the OS alone is often almost negligi-
ble, especially for systems deployed in small numbers. For example, in the case of the Toronto Airport Assistant (48
lines, Nuance), the price of the Windows operating system was not even 1% of the total cost.
Therefore, the price of the operating system is secondary to the availability of strong development tools, component
libraries and middleware. The resulting increase in developer productivity has the potential to far outweigh the savings
on the purchase price of the operating system.
Open Standards for Speech - Recent years have brought multiple exciting initiatives to standardize the development of
speech applications, including the well-known VoiceXML, SALT, X+V, and CCXML. A discussion of their respective tech-
nical merits is beyond the scope of this paper, but there is a wealth of relevant information available from many sources.
Similarly, we will not attempt to speculate which standard will ultimately prevail in the future. Instead, we will point out a
few less obvious aspects that may impact your immediate strategy today. We will focus on VoiceXML, as it is the only
standard in deployment today. The fundamental question is: will your project benefit from using the standard or would
you be better off with a proprietary system? Unfortunately, the answer is not always straightforward. Below we present a
few ideas to consider.
A POWERFUL VOICE IN COMMUNICATION SOLUTIONS
At first, try to articulate the exact benefits of VoiceXML for your particular application. Next, look at the cost of your
respective choices. Request quotes for equivalent VoiceXML and proprietary platforms and analyze them carefully. You
will most likely find VoiceXML environments to be substantially more expensive. We received a quote for a typical 24
port VoiceXML Gateway (including hardware, but without ASR or TTS licenses) for $1265 per port. You can build an
equivalent system using one of the popular proprietary RAD tools at half this cost.
Secondly, again consider the productivity of your programmers. VoiceXML, a complex language in itself, is not enough to
build an application. You still need scripts (CGI, Java, ASP, JavaScript or VBScript) to implement your business logic and
then you need to host it all on a web server. The resulting environment is complex and does not offer a ready-to-use
framework specific to telephony. Consequently, you'll need to think about multithreading your telephony channels, deal-
ing with call state machines and synchronizing access to the global data structures, just to name a few.
Debugging your application is another important consideration. We don't know of any VoiceXML Gateway that offers an
IDE supporting the complete environment. Even for VoiceXML alone, the development tools are in short supply.
Furthermore, most tools are merely GUI overlays on top of VoiceXML syntax -- good only for creating static pages (as
opposed to dynamic pages, which are generated on-the-fly from database queries and program logic). Therefore, before
committing to a VoiceXML gateway, ask about source-level debugging, handling of call state machines, multithreading of
application code, accessing databases and other basic programming tasks. Our point is that today's VoiceXML develop-
ment environment is still primitive when compared to the industry-standard IDEs, like Microsoft Visual Studio.
Finally, take a look at the planned functionality of your application. VoiceXML, by definition, is limited by its own specifi-
cation. VoiceXML has been designed specifically for speech-based user dialogs and that's where it excels. If your appli-
cation is about call control, then beware: your only hope will be proprietary extensions, which in turn ties you to a specif-
ic platform, and negates the benefits of vendor-independence and application portability.
So, are we advocating ignoring open standards? No, to the contrary, we strongly believe in the value of open standards
and their future wide acceptance. VoiceXML will steadily gain popularity, especially once CCXML addresses the current
shortcomings in call control, once the platform prices are reduced and once better developer tools emerge. Similarly,
SALT offers a great future promise because of its tight integration with the Microsoft environment (including rich develop-
ment tools). Until this happens, however, a proper RAD tool can get the job done much quicker and cheaper.
Our recommendation: don't be afraid of using proprietary environments if they are a better fit for your application and
especially when you can realize significant savings. However, make sure that your tools have a well-defined migration
path, should the open-standard market develop for your application in the future. In other words, ensure that your tools
either support VoiceXML or are properly integrated with VoiceXML products. You could even consider a hybrid solution
to combine the best of both worlds. For example, a front-end node that does a heavy-duty call processing (built on a
proprietary system), which calls a VoiceXML gateway to execute best-of-breed third-party VoiceXML components and
applications.
One possible cost-saving strategy is again a hybrid solution, where speech recognition is applied selectively, to the
areas, which bring the most benefit to the application. For example, an auto-attendant and voice mail system may be
speech-enabled on the customer-facing side and touchtone for the employees.
A POWERFUL VOICE IN COMMUNICATION SOLUTIONS
Make sure that the tools and middleware that you buy are as good for traditional IVR as they are for sophisticated
speech applications. Touchtone technology is not going away any time soon.
5 Conclusion
Speech today is ready for prime time. Thanks to reliable, accurate and commercially available speech engines, many
compelling applications became possible, and many have been implemented already. At the same time, we continue to
see customers walking away from great speech applications and settling for the old-style touchtone solutions. The pri-
mary culprit is cost - in our view the most important factor barring speech from wider acceptance. Unfortunately, the cost
will stay high as long as speech remains limited to a niche market. Our industry has yet to come up with a creative way
to get out of this impasse.
Nevertheless, we believe that speech applications present opportunities for cost savings, even with today's high-priced
licenses and platforms. This paper has presented a number of practical guidelines for lowering the cost of speech by
properly selecting tools and technologies. We hope that applying these guidelines will help you to build a better business
case for speech on your next project.
6 About Pronexus
With nearly a decade of experience and more than 3000 clients and partners around the world, Pronexus Inc. has estab-
lished itself as a leader in Computer Telephony and Speech applications for wired and wireless environments. The com-
pany is the developer of the award-winning VBVoice™, a Rapid Application Development tool for building business-criti-
cal CT and speech solutions. It also provides professional services for businesses requiring custom applications and
develops OnCall?, a line of turnkey business solutions for a variety of industries and applications. Comprehensive sup-
port services and acclaimed training complete the firm's offerings.
A POWERFUL VOICE IN COMMUNICATION SOLUTIONS
7 Glossary