Sei sulla pagina 1di 12

System Architectures for Speech

A Practical Guide to Lowering Costs

Andrew Kozminski

A POWERFUL VOICE IN COMMUNICATION SOLUTIONS

260 Terence Matthews Cr.


Kanata, Ontario, Canada, K2M 2C7
Ph: 877-766-3987 Ext. 549
andrew.kozminski@pronexus.com

First presented at
AVIOS SPEECH DEVELOPERS
CONFERENCE & EXPO
March 31st - April 3rd, 2003

February 2003

Disclaimer: This paper presents personal views and opinions of the author at this time, which are not binding on Pronexus and are subject to change.
Table of Contents

1 EXECUTIVE SUMMARY ...........................................................................................................3


2 INTRODUCTION .....................................................................................................................3
3 WHY ISN'T SPEECH UBIQUITOUS? .....................................................................................4
3.1 COST OF COMPLEXITY ...................................................................................................................4
3.2 COST OF SOPHISTICATED BUT IMPERFECT TECHNOLOGY ..............................................................4
3.3 COST OF CORE SPEECH TECHNOLOGY (LICENSES) ...............................................................5
3.4 COST OF TUNING ................................................................................................................................5

4 ARCHITECTING FOR LOWER COST .....................................................................................5


4.1 SIMPLIFY THE COMPLEX ...................................................................................................................5
4.2 BREAKUP INTO MODULES ...................................................................................................................6
4.3 MAINTAIN VENDOR INDEPENDENCE ......................................................................................................7
4.4 CONSERVE YOUR RESOURCES ......................................................................................................8
4.5 DON'T SKIMP ON TUNING! ...................................................................................................................8
4.6 SELECT PLATFORM TO FIT YOUR APPLICATION ............................................................................9
4.7 FALLBACK TO TOUCHTONE .................................................................................................................10

5 CONCLUSION ...................................................................................................................11
6 ABOUT PRONEXUS ........................................................................................................11
7 GLOSSARY ..............................................................................................................................12
A POWERFUL VOICE IN COMMUNICATION SOLUTIONS

1 Executive Summary

Despite the great progress in speech technology over the past twenty years, speech applications are still not widely
used by carriers and enterprises. Although there are various reasons for the limited deployment of speech applications,
we believe that cost is the main barrier to wider adoption. The great cost of developing and deploying speech applica-
tions stems from several factors, including the complexity, technology inaccuracy and licensing costs, as well as post-
installation efforts.

Building a voice-enabled application is a complex task since it involves the use of multiple core technologies from multi-
ple vendors. The complexity of the application design is also increased by the lower accuracy levels produced by speech
recognition applications (compared to DTMF) and the lack of restriction on the user input, leading to additional program-
ming logic to be dedicated to error compensation and failures.

In addition to design complexities and technology inaccuracies, which require time and human resources, speech appli-
cations also necessitate monetary investments in expensive ASR and TTS licenses. Due to their complicated nature,
speech applications require attention and tuning even after they are deployed (adding and deleting subscribers to gram-
mars for example), increasing the cost of maintenance.

With the above in mind, is it possible to design a lower cost speech application? There are several guidelines that can
help developers to lower system costs and design for the best cost-functionality balance.

Since building a modern speech application requires integration of components provided by different vendors, the most
efficient way to create an application is to use a high-level Rapid Application Development (RAD) tool. RAD tools hide
low-level complexity and provide a uniform development environment for multi-vendor components. In addition, the use
of a development environment will help developers maintain vendor independence when it comes to speech resources
and telephony hardware.

Development and deployment costs can be reduced by using modular architecture. Among other things, modular archi-
tecture allows for independent provisioning and software hot-swaps, eliminating costly downtimes and providing
increased reliability, while enabling for resource sharing. Using royalty-free engines, proper engineering of license man-
ager, and floating licenses can lower the cost of ASR and TTS licenses.

These guidelines provide a framework for lowering the cost of speech-enabled applications. The remainder of this docu-
ment describes them in more detail.

2 Introduction
It's certainly no secret that the last few years have brought some dramatic changes to the high-tech landscape. The
harsh economic reality tested many business formulas, technologies and products. Unfortunately, many of the once
widely hyped ideas did not survive and all of us have learned an important lesson: at the end of the day, the only suc-
cessful products/solutions are those that make money. In other words, it's all about cost, price and, above all, Return on
Investment (ROI).

So, how can we build a compelling ROI case for speech technology? Of course the return depends on the business side
of your application/project. Because this is a technical paper, we'll only briefly touch on this aspect. But the investment is
mostly about technology: it's the cost of the building blocks that you use and the engineering and tuning hours that you
spend. We'll take a closer look at this aspect, because the technical decisions made early in project can have a dramatic
impact on the final cost. No matter how much engineers hate this, more often than not, it is the cost that kills the great
product ideas.
With the above in mind, this paper will review different technologies, architectures and implementation strategies for
A POWERFUL VOICE IN COMMUNICATION SOLUTIONS

building speech applications, with a special focus on costs versus functionality.

Throughout this paper we'll refer to the example of the Airport Assistant, a real life speech-enabled application imple-
mented by one of our partners. Airport Assistant allows callers to select by name one of several hundred airport servic-
es, request real-time flight information and be notified (at a provided cell-number) about any changes in the schedule.
Airport Assistant has been successfully deployed at the biggest airport in Canada - Toronto's Pearson International.

3 Why isn't speech ubiquitous?


First, let's put things in perspective by asking another question: how much is speech used today? According to
Datamonitor, the total supply-side market for "voice business technologies and services" (meaning all systems employ-
ing TTS and ASR) is growing from $629 million in 2001 to $4.3 billion in 2007. That's a compound annual growth rate
(CAGR) of 38%. It looks impressive, but only until you realize a key point: these are total numbers that must be further
broken down into verticals. For example, if you are in the healthcare and pharmaceuticals business, your share is only
2% (again, a Datamonitor number), which for 2001 translated to $12.5 million worldwide. Obviously, this represents a
low penetration by any stretch of imagination.

So why, despite almost twenty years of continuous improvements in technology, are speech applications still so slow to
sell? Most analysts agree that the technology itself is "ready for prime time" and that customers are ready to accept its
clear benefits. Where is the problem? Many different reasons have been cited, from numerous false starts of immature
technology, to bad design, to cultural aversion against "talking to a machine". They all are true, but in our opinion, a
major remaining barrier to wider adoption is cost. The remainder of this section discusses the main factors contributing
to the high cost of speech applications today.

3.1 Cost of Complexity


Despite many years of great technological progress, building voice applications is still a complex task. It typically
involves integration of multiple core technologies: ASR, TTS, speaker verification, speech object libraries, telephony
hardware, call processing, web services and databases, all tied together with thousands of lines of custom code.
Although all vendors strive to position themselves as one stop shopping, none are leaders in all areas, which forces sys-
tem architects to pick and choose best-of-breed components from multiple suppliers. A classic example is multilingual
TTS - no vendor has the best voice quality in all languages.

Of course, components coming from different vendors are not designed to work easily together, which makes implemen-
tation difficult and prolongs the learning curve. It also requires a team with a diverse and sophisticated skill set: telepho-
ny hardware, protocols, real-time programming and Web development, to name a few. Even today, experienced develop-
ers and voice system designers are not cheap and their salaries quickly add up to the cost of a project.

3.2 Cost of Sophisticated but Imperfect Technology


When compared to touchtone, there is no doubt that speech technology allows for a much more effective and satisfying
user experience. However, those benefits come at a price: building a good speech application requires a lot more effort
invested in the human-machine interface (HMI). Touchtone based user interfaces were easy to design: the user input
was always limited to one out of ten digits and the DTMF detectors guaranteed almost 100% accuracy. With speech and
natural language, the situation is vastly different. Not only is recognition never 100% accurate but also the user input is
not restricted in any way. This puts a lot of extra burden on the application design and dramatically increases its com-
plexity - it is not uncommon that more programming logic is dedicated to compensating for errors and failures (low accu-
racy, ambiguity, out-of-grammar vocabulary, and so on) than to the actual business features. As a result, designing a
good speech interface is a difficult art requiring the interdisciplinary talents of computer science, linguistics and human
factors psychology.
A POWERFUL VOICE IN COMMUNICATION SOLUTIONS

3.3 Cost of Core Speech Technology (Licenses)


Today, the core speech technology is offered by a handful of vendors who have spent heavily on development of their
products and are now trying (rightly so) to recoup and capitalize on their investment. Consequently, TTS and ASR
licensees continue to be very expensive.

As an example, the ASR licenses (no TTS was used) for the Toronto Airport Assistant application, described in the intro-
duction, originally accounted for almost 50% of the total system cost. The remaining 50% paid for everything else,
including redundant hardware, development tools, database server, UPS and more. The cost of ASR licenses was later
reduced to less than 23%, by better license management that took advantage of the specific call patterns. This approach
is discussed in more details in the following sections.

With such high licensing costs, it is unfortunate that many commercial speech platforms seem to completely ignore the
issue and continue to use licenses very ineffectively. It is not uncommon that one application port could require two or
more ASR licenses, especially if multiple languages or "always active hot-words" are involved. Obviously, doubling or
tripling the licensing cost has a dramatic impact on the end user price of the finished application.

3.4 Cost of Tuning


The high cost of speech applications doesn't end with the first installation. Even the best-designed systems require a lot
of tuning before they can be turned into a full production. Furthermore, some applications (such as dial-by-name auto-
attendants needing constant addition or deletion of subscribers to their grammars) require ongoing tuning throughout
their life cycle. Tuning typically requires the attention of a computational linguist, which is still a unique and therefore
hard to find professional. This is in sharp contrast to the traditional touchtone systems, which, once tested by a software
QA team, would run virtually maintenance-free in production.

4 Architecting for Lower Cost

We've identified cost as one of the main barriers to the wider adoption of speech applications, in particular in mid-market
environments. High up-front costs result in marginal ROI stories, and it doesn't matter how elegant or efficient the appli-
cation is if no one buys it. Thus, it is important to design for the best cost-functionality balance.

Today, system architects and developers of speech-based telephony applications face many difficult choices regarding
platforms, tools, speech technologies, and so on. The abundance of new standards only adds to the overall confusion.
Of course, no single system architecture could meet all possible requirements, but at the same time, selecting the right
architecture has fundamental impact, especially on cost. This section offers a number of specific recommendations
based on our real-life experiences.

4.1 Simplify the Complex


Modern speech applications are very rarely (if ever) built from scratch. Instead, multiple ready-to-go building blocks are
put together including: ASR and TTS engines, object libraries, call processing frameworks, development tools. As a
result, building a modern speech application has become a matter of integration. This has certainly simplified the task,
but it hasn't made it easy - integration comes with problems of its own.

Typically, speech building blocks (cards or speech engines) come with native-APIs that is low-level interfaces requiring
programming in C++. But writing a speech application in C++ is not for the faint-of-heart. It may sound like a fun chal-
lenge to developers, but certainly not to a project manager who is responsible for budgeting and scheduling. Low-level
details (such as call state machines, resource management, multi-threading, voice buffering, ActiveX, COM, and sock-
ets) will very quickly defocus developers from solving the actual business problem at hand. Learning APIs and telephony
abstraction models from different vendors will dramatically extend the learning curve. Don't count on the telephony stan-
A POWERFUL VOICE IN COMMUNICATION SOLUTIONS

dards for help: unfortunately, despite multiple attempts (such as TAPI, SAPI, TSAPI, and S.100), there is no universal
standard API for low-level telephony functionality. Adoption by vendors is random at best, and interoperability of particu-
lar implementations can never be taken for granted.

Can this complexity of native-APIs be avoided? Absolutely - by leveraging the work done by others. In practice, this
means using one of the high-level Rapid Application Development (RAD) tools. RAD tools hide low-level complexity and
abstract the mishmash of multi-vendor components into a uniform development environment, focused on the business
logic, not technology. Some RAD tools reduce the learning curve even more by leveraging the power of one of the
industry standard Integrated Development Environments (IDE), for example Microsoft Visual Studio, and popular devel-
opment languages, such as Visual Basic or Java.

Properly selected development tools can save a lot of money. On the Toronto Airport Assistant project, both the develop-
ers and project manager claimed that the switch to an appropriate tool cut the delivery time by more than 50%.

Of course, not all RAD tools are created equally and they should be carefully selected for a specific project and its
development team. Ideally, you should look for a tool that combines a high-level visual design environment with a flexible
programming environment.

Visual Design - Most RAD tools use drag-and-drop GUI interfaces, which increase programmer productivity and
enhance structure and readability of the source code. However, when selecting a tool, consider inherent limitations of
visual design and programming. Ready-to-use building blocks work very well as long as all functionality is available.
Some applications may benefit from the simplicity of this approach; however, most practical applications require function-
ality that goes beyond ready-to-go functions. Sooner or later, you'll need to customize blocks or integrate them with
external third-party components. In other words, choose tools that combine the productivity gains of visual programming
with a powerful programming environment.

Programming Environment -Building speech applications today is all about integration and customization. Any serious
application requires some custom programming. The right development and debugging tools can save your project when
you least expect it. Make absolutely sure that your tools support a serious, industry-standard programming language,
source level debugging, seamless invocation of component libraries (such as DCOM, ActiveX, and CORBA.) and control
over multithreading. If you're building an application on Windows, don't miss out on the benefits of the next-generation
technology from Microsoft- your tool must support .NET!

As a typical example, one of our customers built a large outbound dialing application that depended on answering
machine detection. The initial implementation used the original detection algorithm embedded into Dialogic cards.
Unfortunately, statistical accuracy (which depends highly on the target calling area) couldn't be verified before field trials.
When the first results came in, accuracy levels were around 80%, significantly below expectations. Because their RAD
tool fully integrated into Microsoft Visual Studio, the programmers were able to devise a custom solution that boosted
accuracy to 96%. This would be impossible without a flexible programming environment.

4.2 Breakup into Modules


The ability to break your speech application into cooperating modules is a must. Not only does it improve scalability, reli-
ability and performance of your system, but it also saves you money in both development and production.

A modular system is cheaper to build and maintain. In development, programmers benefit from working in parallel on
well-defined modules. In production, independent module provisioning and software hot-swaps eliminate costly system
downtimes. At the same time, separating application logic from telephony and speech processing allows resource shar-
ing, which in turn leads to more efficient utilization. Finally, distributing your modules across a LAN enables load balanc-
ing and effortless scalability - again resulting in savings on system maintenance.
A POWERFUL VOICE IN COMMUNICATION SOLUTIONS

The biggest benefit, however, comes from increased reliability of a modular system. Nothing is more frustrating to callers
than a system that crashes into "dead silence" in the middle of a transaction. An unreliable system will be soon pulled
out of production, which always means significant financial losses.

A monolithic executable is only as reliable as its weakest component, while a modular system can stay operational even
after losing one of its modules. Therefore, it is very important that application modules execute properly separated from
each other and from the system processes, so that a fatal error in one doesn't bring down the whole system. The mod-
ules should run out-of-process, or even better, distributed across a LAN. Ideally, modules should be compiled directly
into stand-alone executables, not into intermediate scripts or p-code. Not only does this speed up program execution, it
also removes the dependency on a shared runtime engine as a single point of failure.

4.3 Maintain Vendor Independence


From a cost perspective, vendor-independence comes into play with respect to speech resources and telephony hard-
ware.

Speech Resources - Selecting the right speech components for your application is very important. Speech technology is
still complex and very expensive, but the quality and accuracy of the chosen engines could ultimately make a difference
between success or failure of your project.

In practice, when it comes to speech processing, the "one size fits all" approach does not work. This is true for ASR, but
even more so for TTS - all engines and languages are not equal. Therefore, it pays to carefully research and evaluate
different vendors before deciding on a TTS product. Remember that perception of quality is subjective and may depend
on your audience and application. As a result you may end up working with more than one vendor at a time and chang-
ing vendors as your application evolves. Unfortunately, this may require re-doing the integration work many times, result-
ing in additional cost. If you're lucky, the engine of your choice may support a standard API (like SAPI), but not all do.
And even if it does support SAPI, these interfaces are often far from perfect. Fine variations in timing, buffering schemes
and performance can result in irritating gaps, clicks and delays. Usually, better results are achieved through individually
crafted, native APIs, but this again means additional development costs.

The best way to achieve vendor-independence is to use a middleware abstraction layer, which in turn works with num-
ber of alternate engines. Again, a RAD tool is appropriate: it will protect your investment in application development
should a shift in requirements, technology or vendor strategies necessitate the move to another engine. It will also allow
you to experiment with multiple speech products to find the best price-quality balance for your application.

Telephony Hardware - Similar to speech resources, vendor-independence of telephony hardware can save you money. It
is not uncommon for speech applications to run on more than one brand of hardware or to switch vendors for better pric-
ing. In general, the smaller telephony hardware suppliers tend to be less expensive and much more accommodating
when it comes to technical support plans, and if you're new to computer telephony you will definitely require support. On
the other hand, smaller vendors may not support all cards and protocols. The most popular, i.e. analog T1/E1, ISDN-
PRI, H.323, R2-CAS, etc. are a must. However, it's also worth paying attention to the less obvious capabilities, such as
SIP, PBX set-emulator cards and transfer-on-CO protocols like TBCT, RLT and Q.SIG. Again, make sure that your mid-
dleware framework allows experimenting with different cards and protocols from multiple vendors, and keep in mind that
your requirements may change with time.

PBX Integration - Most analysts predict that call centers will account for the biggest slice of the "speech market pie" in
the coming years. If your organization is engaged in the call-center market, you know that developing applications for
just one PBX-brand is not enough. Given that speech is so much more expensive than touchtone, PBX vendor-inde-
pendence is becoming more important than ever. Again, make sure that your development tools or middleware offer a
good PBX integration story.
A POWERFUL VOICE IN COMMUNICATION SOLUTIONS

What does this mean in practical terms? Today, many PBX vendors are changing their traditional proprietary architec-
tures and have begun opening up to third-party applications. Since PBXs are built for the enterprise (with its strong
Microsoft presence), this trend is most visible in Windows, where in recent years we've observed a renewed interest in
TAPI as the integration technology of choice. As a result, you may expect a complete and well-tested TAPI Service
Provider for almost any switch, especially for modern IP-PBXs. In our opinion, building your speech application on TAPI
is the best strategy for widening your customer base and consequently maximizing your ROI in the call-center market

4.4 Conserve your Resources


As noted earlier, ASR and TTS licenses are the most expensive, yet also the most misused resources in speech applica-
tions. Some commercially available platforms use two or more ASR licenses per application port, particularly in multilin-
gual or hot-word applications. Below we present a few practical guidelines for saving money:

Royalty Free Engines - Yes, they are available! Companies like Microsoft and Aculab offer license-free ASR and TTS
technologies of high quality that may be perfect for your application. One word of caution: customers may accept a lower
quality TTS (as long as the message is understandable), but they have much less tolerance for imperfect ASR. From our
experience, speech recognition has to work close to perfectly, or it will be deemed useless and dropped. In other words,
carefully evaluate your ASR alternatives.

As with the other elements of your application, maintaining vendor-independence works to your advantage, allowing you,
for example, to experiment with free engines before deciding to spend your dollars on licenses. (As a side note, none of
the speech vendors that we know of accepts returns of purchased licenses). Again, picking a middleware framework that
supports both free and commercial engines is, in our opinion, the best strategy.

One license per channel - If you decide to use a commercial speech engine from one of the industry leaders, invest
some time to properly engineer your license manager - a lot of money can be saved by this effort. There is no technical
reason to use more than one engine license (TTS or ASR) per application channel. Even systems using multiple lan-
guages or parallel grammars to implement hot-words can be designed to use one license per channel at any given time.
Make sure that your middleware doesn't force you to unnecessary double your resources.

Floating licenses - You should also keep in mind that many applications don't require ASR and TTS for the whole dura-
tion of a call. As an example, consider a pre-paid calling-card system. It uses speech recognition to identify callers,
checks account balances and then bridges calls to outbound trunks. In this scenario, speech resources (ASR and TTS)
are only required for a small fraction of a call, possibly as low as 10%. Once a call is bridged, the resources can be
redeployed to serve other channels -- this presents a great savings opportunity. If licenses could float dynamically
between channels, in theory the savings could be as high as 90%. Unfortunately, not many platforms allow floating
speech resources, but some systems do. Given the savings, it pays to ensure that the tool you choose allows for proper
license management to take advantage of the specific calling patterns in your application.

4.5 Don't Skimp on Tuning!


Any non-trivial speech application requires extensive testing and tuning, much more than a traditional touchtone system.
This aspect is new, and often comes as a surprise to designers coming from an old IVR background.

Tuning is much more than just tweaking grammars. It is an iterative process of analyzing system performance and
repeatedly applying the best design practices in order to arrive at the most satisfying user experience and in order to
work around technology imperfections. As a result, the tuning phase can take many months and requires an interdiscipli-
nary team of professionals, including not only developers and testers but also experts in linguistics and often psychology.
The resulting cost is substantial, but in our experience this is money well spent.
Planning for tuning is difficult, because speech systems, unlike touchtone, are highly dependent on the demographics,
local accents, language mix and even the culture of the target audience. Some applications, such as speech-enabled
A POWERFUL VOICE IN COMMUNICATION SOLUTIONS

auto-attendants, may require regular, on-going tuning as the grammar (i.e. a list of employee names) changes over time.
Typically, end users are not able and should not be expected to perform tuning themselves and the application should
be specifically designed for on-going, remote maintenance by the vendor.

Tuning large grammars, such as a city's phone book, tends to be particularly challenging and should be approached with
special caution. The experienced speech technology providers seem to be well aware of the possible problems: one
highly recognized vendor would sell us a ready-to-use grammar containing tens of thousands of names, but would not
venture into signing a contract to get it working in the field.

Unfortunately, saving on tuning is not easy and may jeopardize the final quality of your product. Some savings may be
achieved by employing off-the-shelf speech component libraries such as Nuance Speech Objects or SpeechWorks
Dialog Modules. But in general, tuning is not the area in which to be penny-pinching. In speech recognition applications,
users typically have very little patience for shaky technology. The system has to be almost perfect or it will not be used.
There is no middle ground.

4.6 Select Platform to Fit Your Application


Over the last few years, we've observed two promising trends impacting telephony and speech applications: open
source operating systems (mainly Linux) and XML-based scripting languages (mainly VoiceXML and SALT). But before
you bet your budget, take a careful look at the cost -the bottom-line ROI of your application is the criteria of success.

Operating System - The choice of operating system fundamentally impacts many aspects of a speech application.
Today, there are three main choices: Unix (mainly Solaris), Linux or Windows. While discussing the merits of each OS is
beyond the scope of this paper, we will discuss some important considerations specific to telephony and speech.

First, keep your target market in mind. The old bias against Windows still holds strong in some traditional telephony envi-
ronments, especially among carriers in North America. Even recently, we've seen an already completed application
being ported to Solaris after approaching carriers with a Windows version. However, other regions of the world regard
Windows much more favorably. Even in North America, the situation is much different in the enterprise, where Windows
naturally fits into the desktop and business back-end dominated by Microsoft.

The most widely quoted complaints against Windows are reliability and price. We believe that this continued bias is no
longer justified - the modern Windows is reliable enough for speech applications, both for carriers and enterprise. As for
the price, Windows compares favorably to Unix, and while Linux is free, the price of the OS alone is often almost negligi-
ble, especially for systems deployed in small numbers. For example, in the case of the Toronto Airport Assistant (48
lines, Nuance), the price of the Windows operating system was not even 1% of the total cost.

Therefore, the price of the operating system is secondary to the availability of strong development tools, component
libraries and middleware. The resulting increase in developer productivity has the potential to far outweigh the savings
on the purchase price of the operating system.

Open Standards for Speech - Recent years have brought multiple exciting initiatives to standardize the development of
speech applications, including the well-known VoiceXML, SALT, X+V, and CCXML. A discussion of their respective tech-
nical merits is beyond the scope of this paper, but there is a wealth of relevant information available from many sources.
Similarly, we will not attempt to speculate which standard will ultimately prevail in the future. Instead, we will point out a
few less obvious aspects that may impact your immediate strategy today. We will focus on VoiceXML, as it is the only
standard in deployment today. The fundamental question is: will your project benefit from using the standard or would
you be better off with a proprietary system? Unfortunately, the answer is not always straightforward. Below we present a
few ideas to consider.
A POWERFUL VOICE IN COMMUNICATION SOLUTIONS

At first, try to articulate the exact benefits of VoiceXML for your particular application. Next, look at the cost of your
respective choices. Request quotes for equivalent VoiceXML and proprietary platforms and analyze them carefully. You
will most likely find VoiceXML environments to be substantially more expensive. We received a quote for a typical 24
port VoiceXML Gateway (including hardware, but without ASR or TTS licenses) for $1265 per port. You can build an
equivalent system using one of the popular proprietary RAD tools at half this cost.

Secondly, again consider the productivity of your programmers. VoiceXML, a complex language in itself, is not enough to
build an application. You still need scripts (CGI, Java, ASP, JavaScript or VBScript) to implement your business logic and
then you need to host it all on a web server. The resulting environment is complex and does not offer a ready-to-use
framework specific to telephony. Consequently, you'll need to think about multithreading your telephony channels, deal-
ing with call state machines and synchronizing access to the global data structures, just to name a few.

Debugging your application is another important consideration. We don't know of any VoiceXML Gateway that offers an
IDE supporting the complete environment. Even for VoiceXML alone, the development tools are in short supply.
Furthermore, most tools are merely GUI overlays on top of VoiceXML syntax -- good only for creating static pages (as
opposed to dynamic pages, which are generated on-the-fly from database queries and program logic). Therefore, before
committing to a VoiceXML gateway, ask about source-level debugging, handling of call state machines, multithreading of
application code, accessing databases and other basic programming tasks. Our point is that today's VoiceXML develop-
ment environment is still primitive when compared to the industry-standard IDEs, like Microsoft Visual Studio.

Finally, take a look at the planned functionality of your application. VoiceXML, by definition, is limited by its own specifi-
cation. VoiceXML has been designed specifically for speech-based user dialogs and that's where it excels. If your appli-
cation is about call control, then beware: your only hope will be proprietary extensions, which in turn ties you to a specif-
ic platform, and negates the benefits of vendor-independence and application portability.

So, are we advocating ignoring open standards? No, to the contrary, we strongly believe in the value of open standards
and their future wide acceptance. VoiceXML will steadily gain popularity, especially once CCXML addresses the current
shortcomings in call control, once the platform prices are reduced and once better developer tools emerge. Similarly,
SALT offers a great future promise because of its tight integration with the Microsoft environment (including rich develop-
ment tools). Until this happens, however, a proper RAD tool can get the job done much quicker and cheaper.

Our recommendation: don't be afraid of using proprietary environments if they are a better fit for your application and
especially when you can realize significant savings. However, make sure that your tools have a well-defined migration
path, should the open-standard market develop for your application in the future. In other words, ensure that your tools
either support VoiceXML or are properly integrated with VoiceXML products. You could even consider a hybrid solution
to combine the best of both worlds. For example, a front-end node that does a heavy-duty call processing (built on a
proprietary system), which calls a VoiceXML gateway to execute best-of-breed third-party VoiceXML components and
applications.

4.7 Fallback to Touchtone


Yes indeed, this is the last resort. For the record, we truly believe in the superiority of speech recognition. But, don't dis-
card the old touch-tone just yet --At the end of the day, reverting to touchtone may cut the cost enough to get your budg-
et approved. Whether we like it or not, many of our customers today opt to save through touchtone, and the fact remains
that some applications won't benefit significantly from speech recognition.

One possible cost-saving strategy is again a hybrid solution, where speech recognition is applied selectively, to the
areas, which bring the most benefit to the application. For example, an auto-attendant and voice mail system may be
speech-enabled on the customer-facing side and touchtone for the employees.
A POWERFUL VOICE IN COMMUNICATION SOLUTIONS

Make sure that the tools and middleware that you buy are as good for traditional IVR as they are for sophisticated
speech applications. Touchtone technology is not going away any time soon.

5 Conclusion

Speech today is ready for prime time. Thanks to reliable, accurate and commercially available speech engines, many
compelling applications became possible, and many have been implemented already. At the same time, we continue to
see customers walking away from great speech applications and settling for the old-style touchtone solutions. The pri-
mary culprit is cost - in our view the most important factor barring speech from wider acceptance. Unfortunately, the cost
will stay high as long as speech remains limited to a niche market. Our industry has yet to come up with a creative way
to get out of this impasse.

Nevertheless, we believe that speech applications present opportunities for cost savings, even with today's high-priced
licenses and platforms. This paper has presented a number of practical guidelines for lowering the cost of speech by
properly selecting tools and technologies. We hope that applying these guidelines will help you to build a better business
case for speech on your next project.

6 About Pronexus

With nearly a decade of experience and more than 3000 clients and partners around the world, Pronexus Inc. has estab-
lished itself as a leader in Computer Telephony and Speech applications for wired and wireless environments. The com-
pany is the developer of the award-winning VBVoice™, a Rapid Application Development tool for building business-criti-
cal CT and speech solutions. It also provides professional services for businesses requiring custom applications and
develops OnCall?, a line of turnkey business solutions for a variety of industries and applications. Comprehensive sup-
port services and acclaimed training complete the firm's offerings.
A POWERFUL VOICE IN COMMUNICATION SOLUTIONS

7 Glossary

API - Application Programming Interface


ASR - Automatic Speech Recognition
BRI - Basic Rate Interface
CAGR - Compound Annual Growth Rate
CO - Central Office
GUI - Graphical User Interface
HMI - Human-Machine Interface
IDE - Integrated Development Environment
ISDN - Integrated Subscriber Digital Network
OS - Operating System
PBX - Private Branch Exchange
PRI - Primary Rate Interface
RAD - Rapid Application Development
RLT - Release Link Transfer
ROI - Return on Investment
TAPI - Telephony Application Interface
TBCT - Two B-Channel Transfers
TTS - Text to Speech
VUI - Voice User Interface

Potrebbero piacerti anche