Sei sulla pagina 1di 15

PUBLISHED IN PROCEEDINGS OF THE IEEE, VOL. 88, NO.

8, 1166-1180, AUGUST, 2000 1166

Conversational Interfaces: Advances and Challenges


Victor W. Zue and James R. Glass

Abstract— The last decade has witnessed the emergence ing conversational interface1 is the subject of this paper.
of a new breed of human computer interfaces that com- Many speech-based interfaces can be considered conver-
bines several human language technologies to enable humans
to converse with computers using spoken dialogue for in- sational, and they may be differentiated by the degree with
formation access, creation, and processing. In this paper, which the system maintains an active role in the conver-
we introduce the nature of these conversational interfaces, sation. At one extreme are system-initiative, or directed-
and describe the underlying human language technologies
on which they are based. After summarizing some of the dialogue transactions where the computer takes complete
recent progress in this area around the world, we discuss de- control of the interaction by requiring that the user answer
velopment issues faced by researchers creating these kinds a set of prescribed questions, much like the touch-tone im-
of systems, and present some of the ongoing and unmet re-
search challenges in this field.
plementation of interactive voice response (IVR) systems.
In the case of air travel planning, for example, a directed-
Keywords—Conversational interfaces, spoken dialogue sys-
tems, speech understanding systems. dialogue system could ask the user to “Please say just the
departure city.” Since the user’s options are severely re-
stricted, successful completion of such transactions is easier
I. Introduction to attain, and indeed some successful demonstrations and
deployment of such systems have been made [5,64]. At the
C OMPUTERS are fast becoming a ubiquitous part of
our lives, brought on by their rapid increase in per-
formance and decrease in cost. With their increased avail-
other extreme are user-initiative systems in which the user
has complete freedom in what they say to the system, (e.g.,
ability comes the corresponding increase in our appetite for “I want to visit my grandmother”) while the system re-
information. Today, for example, nearly half the popula- mains relatively passive, asking only for clarification when
tion of North America are users of the World Wide Web, necessary. In this case, the user may feel uncertain as to
and the growth is continuing at an astronomical rate. Vast what capabilities exist, and may, as a consequence, stray
amounts of useful information are being made widely avail- quite far from the domain of competence of the system,
able, and people are utilizing it routinely for education, leading to great frustration because nothing is understood.
decision-making, finance, and entertainment. Increasingly, Lying between these two extremes are systems that incor-
people are interested in being able to access the informa- porate a mixed-initiative, goal-oriented dialogue, in which
tion when they are on the move – anytime, anywhere, and both the user and the computer participate actively to solve
in their native language. A promising solution to this prob- a problem interactively using a conversational paradigm. It
lem, especially for small, hand-held devices where conven- is this latter mode of interaction that is the primary focus
tional keyboard and mice can be impractical, is to impart of this paper.
human-like capabilities onto machines, so that they can What is the nature of such mixed initiative interaction?
speak and hear, just like the users with whom they need One way to answer the question is to examine human-
to interact. Spoken language is attractive because it is the human interactions during joint problem solving [30]. Fig-
most natural, efficient, flexible, and inexpensive means of ure 1 shows the transcript of a conversation between an
communication among humans. agent (A) and a client (C) over the phone. As illustrated by
When one thinks about a speech-based interface, two this example, spontaneous dialogue is replete with disflu-
technologies immediately come to mind: speech recogni- encies, interruption, confirmation, clarification, ellipsis, co-
tion and speech synthesis. There is no doubt that these reference, and sentence fragments. Some of the utterances
are important and as yet unsolved problems in their own cannot be understood properly without knowing the con-
right, with a clear set of applications that include document text in which they appear. As we shall see, while present
preparation and audio indexing. However, these technolo- systems cannot handle all these phenomena satisfactorily,
gies by themselves are often only a part of the interface so- some of them are being dealt with in a limited fashion.
lution. Many applications that lend themselves to spoken Should one build conversational interfaces by mimick-
input/output – inquiring about weather or making travel ing human-human interactions? Opinion in this regard is
arrangements – are in fact exercises in information access somewhat divided. Some researchers argue that human-
and/or interactive problem solving. The solution is often human dialogues can be quite variable, containing frequent
built up incrementally, with both the user and the com- interruptions, speech overlaps, incomplete or unclear sen-
puter playing active roles in the “conversation.” Therefore, tences, incoherent segments, and topic switches. Some
several language-based input and output technologies must of these variabilities may not contribute directly to goal-
be developed and integrated to reach this goal. The result- directed problem solving [102]. For practical reasons, it

The authors are with the Spoken Language Systems Group, MIT 1 Throughout this paper, we will use the terms conversational in-
Laboratory for Computer Science, Cambridge, MA 02139 USA (URL: terfaces, conversational systems, and spoken dialogue systems inter-
www.sls.lcs.mit.edu). changeably.
PUBLISHED IN PROCEEDINGS OF THE IEEE, VOL. 88, NO. 8, 1166-1180, AUGUST, 2000 1167

C: Yeah, [umm] I’m looking for the Buford disfluency Customer Agent
Cinema.
A: OK, and you want to know what’s interruption
Act Freq. Words Freq. Words
showing there or ... Acknowledge 47.9 2.3 30.8 3.1
C: Yes, please. confirmation Request 29.5 9.0 15.0 12.3
A: Are you looking for a particular movie?
C: [umm] What’s showing. clarification
Confirm 13.1 5.3 11.3 6.4
A: OK, one moment. back channel Inform 5.9 7.9 27.8 12.7
... Statement 3.4 6.9 15.0 6.7
A: They’re showing A Troll In Central
Park.
TABLE I
C: No. inference
A: Frankenstein. ellipsis Statistics of human-human dialogues in a movie domain [30].
C: What time is that on? co-reference Annotated dialogue acts are sorted by customer usage, and
A: Seven twenty and nine fifty. include frequency of occurrence and average word length.
C: OK, and the others? fragment
A: Little Giant.
C: No.
A: ...
C: ...
A: That’s it.
C: Thank you.
A: Thanks for calling Movies Now.

Fig. 1. Transcript of a conversation between an agent (A) and a


client (C) over the phone. Typical conversational phenomena are
annotated on the right.

may be desirable to ask users to modify their behavior and


interact with the system in a way that is more structured.
However, one may argue that users may feel more com-
fortable with an interface that possesses some of the char-
acteristics of a human agent. As is the case with many Fig. 2. Histograms of utterance length for agents and clients in tasks
other researchers, we have taken the approach of develop- of information access over the phone.
ing a human-machine interface based on analyses of human-
human interactions when solving the same tasks. Regard-
less of the approach, we believe, as do others, that study- and back channel acknowledgements provide reassurances
ing human-human dialogue and comparing it to human- that the utterance is understood or one partner of the con-
machine dialogue can provide valuable insights [7]. versation is still working on the problem.
Over the years, there have been many large corpora
The last decade has witnessed the emergence of some
of human-human dialogues collected and analyzed (e.g.,
conversational systems with limited capabilities. Despite
[1,2,30]). For example, Table I shows statistics of annotated
our moderate success, the ultimate deployment of such in-
dialogue acts computed from human-human conversations
terfaces will require continuing improvement of the core
in a movie information domain [30]. These statistics show
human language technologies (HLTs) and the exploration
that nearly half of the customers’ dialogue turns were ac-
into many uncharted research territories. The purpose of
knowledgements (e.g., “okay,” “alright,” “uh-huh”). 2 As
this paper is to outline some of these new research chal-
another example, consider the histograms of the lengths
lenges. To set the stage, we will first introduce the compo-
of the utterances per turn for agents and clients shown in
nents of a typical conversational system, and outline some
Figure 2 [30]. The statistics were gathered from the tran-
of the research issues. We will then provide a thumb-nail
scripts of over 100 hours of conversation, in more than 1000
sketch of the recent landscape, discuss some development
interactions, between agents and clients over the phone on
issues concerning creation of these systems, and present
a variety of information access tasks. Over 80% of the
some of the ongoing and unmet research challenges in this
clients’ utterances are 12 words or less, with a preponder-
field. While we will endeavor to cover the entire field, we
ance of very short utterances. Closer examination of the
are unavoidably going to draw heavily from our own ex-
data reveals that these short utterances are mostly back
perience in developing such systems at MIT over the past
channel communications, such as “okay,” “I see,” etc. It
ten years (e.g., [31,51,59,89,94,106,113,114]). This is a con-
is important to note that some of the spontaneous speech
sequence more of familiarity then of enthnocentricity. In-
phenomena serve useful roles in human-human communica-
terested readers are referred to the recent proceedings of
tion, and thus should conceivably be incorporated into con-
the Eurospeech Conference, the International Conference
versational interfaces. For example, initial disfluent speech
of Spoken Language Processing, the International Confer-
can serve an attention-getting function, and filled pauses
ence of Acoustics, Speech, and Signal Processing, the In-
2 An average dialogue consisted of over 28 turns between the cus- ternational Symposium on Spoken Dialogue, and other rel-
tomer and the agent. evant publications (e.g., [23]).
PUBLISHED IN PROCEEDINGS OF THE IEEE, VOL. 88, NO. 8, 1166-1180, AUGUST, 2000 1168

SPEECH LANGUAGE ing, pointing, writing, and typing on the input side [21,90],
SYNTHESIS GENERATION and graphics and a talking head on the output side [56], to
achieve the task in hand in the most natural and efficient
Graphs manner.
& Tables DIALOGUE DATABASE

MANAGER The development of conversational interfaces offers a set


of significant challenges to speech and natural language
researchers, and raises several important research issues,
DISCOURSE SEMANTIC some of which will be discussed in the remainder of this
CONTEXT FRAME section.

B. Spoken Input: From Signal to Meaning


SPEECH LANGUAGE
RECOGNITION UNDERSTANDING Spoken language understanding involves the transforma-
tion of the speech signal into a meaning representation that
can be used to interact with the specific application back-
Fig. 3. A generic block diagram for a typical conversational interface. end. This is typically accomplished in two steps, the con-
version of the signal to a set of words (i.e., speech recogni-
tion), and the derivation of the meaning from the word hy-
II. Underlying Technologies and Research Issues potheses (i.e., language understanding). A discourse com-
ponent is often used to properly interpret the meaning of
A. System Architecture
an utterance in the larger context of the interaction.
Figure 3 shows the major components of a typical con-
versational interface. The spoken input is first processed B.1 Automatic Speech Recognition
through the speech recognition component. The natural Input to conversational interfaces is often generated ex-
language component, working in concert with the recog- temporaneously - especially from novice users of these sys-
nizer, produces a meaning representation for the utterance. tems. Such spontaneous speech typically contains disflu-
For information retrieval applications illustrated in this fig- encies (i.e., unfilled and filled pauses such as “umm” and
ure, the meaning representation can be used to retrieve the “aah,” as well as word fragments). In addition, the in-
appropriate information in the form of text, tables and put utterances are likely to contain words outside the sys-
graphics. If the information in the utterance is insuffi- tem’s working vocabulary – a consequence of the fact that
cient or ambiguous, the system may choose to query the present-day technology can only support the development
user for clarification. If verbal conveyance of the infor- of systems within constrained domains. Thus far, some
mation is desired, then natural language generation and attempts have been made to deal with the problem of
text-to-speech synthesis are utilized to produce the spo- disfluency. For example, researchers have improved their
ken responses. Throughout the process, discourse informa- system’s recognition performance by introducing explicit
tion is maintained and fed back to the speech recognition acoustic models for the filled pauses [9,108]. Similarly,
and language understanding components, so that sentences “trash” models have been used to detect the presence of
can be properly understood in context. Finally, a dialogue word fragments or unknown words [38], and procedures
component manages the interaction between the user and have been devised to learn the new words once they have
the computer. The nature of the dialogue can vary signifi- been detected [3]. Suffice it to say, however, that the de-
cantly depending on whether the system is creating or clar- tection and learning of unknown words continues to be a
ifying a query prior to accessing an information database, problem that needs our collective attention. A related topic
or perhaps negotiating with the user in a post information- is utterance- and word-level rejection, in the presence of ei-
retrieval phase to relax or somehow modify some aspects ther out-of-domain queries or unknown words [69].
of the initial query. An issue that is receiving increasing attention by the re-
Figure 3 does not adequately convey the notion that search community is the recognition of telephone quality
a conversational interface may include input and output speech. It is not surprising that some of the first conver-
modalities other than speech. While speech may be the sational systems available to the general public were ac-
interface of choice, as is the case with phone-based interac- cessible via telephone (e.g., [5,10,64,98]), in many cases re-
tions and hands-busy/eyes-busy settings, there are clearly placing presently existing IVR systems. Telephone quality
cases where speech is not a good modality, especially, for speech is significantly more difficult to recognize than high
example, on the output side when the information con- quality recordings, both because of the limited bandwidth
tains maps, images, or large tables of information which and the noise and distortions introduced in the channel
cannot be easily explained verbally. Human communica- [53]. The acoustic condition deteriorates further for cellu-
tion is inherently multimodal, employing facial, gestural, lar telephones, either analog or digital.
and other cues to communicate the underlying linguistic
message. Thus, speech interfaces should be complemented B.2 Natural Language Understanding
by visual and sensory motor channels. The user should be Speech recognition systems typically implement linguis-
able to choose among many modalities, including gestur- tic constraints as a statistical language model (i.e., n-gram)
PUBLISHED IN PROCEEDINGS OF THE IEEE, VOL. 88, NO. 8, 1166-1180, AUGUST, 2000 1169

that specifies the probability of a word given its predeces- In an N -best list, many of the candidate sentences may
sors. While these language models have been effective in differ minimally in regions where the acoustic information
reducing the search space and improving performance, they is not very robust. While confusions such as “an” and
do not begin to address the issue of speech understand- “and” are acoustically reasonable, one of them can often be
ing. On the other hand, most natural language systems are eliminated on linguistic grounds. In fact, many of the top
developed with text input in mind; it is usually assumed N sentence hypotheses might be eliminated before reaching
that the entire word string is known with certainty. This the end if syntactic and semantic analyses take place early
assumption is clearly false for speech input, where many on in the search. One possible solution, therefore, is for
alternative words hypotheses are competing for the same the speech recognition and natural language components
time span in any sentence hypothesis produced by the rec- to be tightly coupled, so that only the acoustically promis-
ognizer (e.g., “euthanasia” and “youth in Asia,”) and some ing hypotheses that are linguistically meaningful are ad-
words may be more reliable than others because of varying vanced. For example, partial theories can be arranged on a
signal robustness. Furthermore, spoken language is often stack, prioritized by score. The most promising partial the-
agrammatical, containing fragments, disfluencies and par- ories are extended using the natural language component
tial words. Language understanding systems designed for as a predictor of all possible next-word candidates; none of
text input may have to be modified in fundamental ways the other word hypotheses are allowed to proceed. There-
to accommodate spoken input. fore, any theory that completes is guaranteed to parse. Re-
Natural language analysis has traditionally been pre- searchers are beginning to find that such a tightly coupled
dominantly syntax-driven – a complete syntactic analysis integration strategy can achieve higher performance than
is performed which attempts to account for all words in an an N -best interface, often with a considerably smaller stack
utterance. However, when working with spoken material, size [33,35,61,110]. The future is likely to see increasing
researchers quickly came to realize that such an approach use of linguistic analysis at earlier stages in the recognition
[12,28,87] can break down dramatically in the presence of process.
unknown words, novel linguistic constructs, recognition er-
rors, and spontaneous speech events such as false starts. B.3 Discourse
Due to these problems, many researchers have tended to Human verbal communication is a two-way process in-
favor more semantic-driven approaches, at least for spo- volving multiple, active participants. Mutual understand-
ken language tasks in constrained domains. In such ap- ing is achieved through direct and indirect speech acts,
proaches, a meaning representation is derived by “spot- turn taking, clarification, and pragmatic considerations. A
ting” key words and phrases in the utterance [109]. While discourse ability allows a conversational system to under-
this approach loses the constraint provided by syntax, and stand an utterance in the context of the previous interac-
may not be able to adequately interpret complex linguistic tion. As such, discourse can be considered to be part of
constructs, the need to accommodate spontaneous speech the input processing stage. To communicate effectively, a
input has outweighed these potential shortcomings. At the system must be able to handle phenomena such as deic-
present time, many systems have abandoned the notion tic (e.g., verbal pointing as in “I’ll take the second one”)
of achieving a complete syntactic analysis of every input and anaphoric reference (e.g., using pronouns as in “what’s
sentence, favoring a more robust strategy that can still their phone number”) to allow users to efficiently refer to
be used to produce an answer when a full parse is not items currently in focus. An effective system should also
achieved [43,88,97]. This can be accomplished by identify- be able to handle ellipsis and fragments so that a user does
ing parsable phrases and clauses, and providing a separate not have to fully specify each query. For instance, if a user
mechanism for gluing them together to form a complete says, “I want to go from Boston to Denver,” followed with,
meaning analysis [88]. Ideally, the parser includes a proba- “show me only United flights,” he/she clearly does not want
bilistic framework with a smooth transition to parsing frag- to see all United flights, but rather just the ones that fly
ments when full linguistic analysis is not achievable. Exam- from Boston to Denver. The ability to inherit information
ples of systems that incorporate such stochastic modelling from preceding utterances is particularly helpful in the face
techniques can be found in [36,60]. of recognition errors. The user may have asked a complex
How should the speech recognition component interact question involving several restrictions, and the recognizer
with the natural language component in order to obtain the may have misunderstood a single word, such as a flight
correct meaning representation? One of the most popular number or an arrival time. If a good context model exists,
strategies is the so-called N -best interface [19], in which the user can then utter a very short correction phrase, and
the recognizer proposes its best N complete sentence hy- the system will be able to replace just the misunderstood
potheses one by one, stopping with the first sentence that is word, preventing the user from having to repeat the entire
successfully analyzed by the natural language component. utterance, running the risk of further recognition errors.
In this case, the natural language component acts as a fil-
C. Output Processing: From Information to Signal
ter on whole sentence hypotheses. Alternatively, competing
recognition hypotheses can be represented in the form of On the output side, a conversational interface must be
a word graph [39], which is more compact than an N -best able to convey the information to the user in natural sound-
list, thus permitting a deeper search if desired. ing sentences. This is typically accomplished in two steps:
PUBLISHED IN PROCEEDINGS OF THE IEEE, VOL. 88, NO. 8, 1166-1180, AUGUST, 2000 1170

the information is converted into well-formed sentences, nents should be tightly integrated, or can remain modular
which are then fed through a text-to-speech (TTS) system but effectively coupled by augmenting text output with a
to generate the verbal responses. markup language (e.g., SABLE [96]) remains to be seen.
Clearly however, these two components would benefit from
C.1 Natural Language Generation a shared knowledge base.
Spoken language generation serves two important roles.
First and foremost, it provides a verbal response to the D. Dialogue Management
user’s queries, which is essential in applications where vi- The dialogue modelling component of a conversational
sual displays are unavailable. In addition, it can provide system manages the interaction between the user and the
feedback to the user in the form of a paraphrase, confirming computer. The technology for building this component is
the system’s proper understanding of the input query. Al- one of the least developed in the HLT repertoire, especially
though there has been much research on natural language for mixed-initiative dialogue systems considered in this pa-
generation (NLG), dealing with the creation of coherent per. Although there has been some theoretical work on the
paragraphs (e.g., [57,77]), the language generation com- structure of human-human dialogue [37], this has not led
ponent of a conversational system typically produces the to effective insights for building human-machine interactive
response one sentence at a time, without paragraph level systems. As mentioned previously, there is also consider-
planning. Research in language generation for conversa- able debate in the speech and language research communi-
tional systems has not received nearly as much attention ties about whether modelling human-machine interactions
as has language understanding, especially in the U.S., per- after human-human dialogues is necessary or appropriate
haps due to the funding priorities set forth by the major (e.g., [13,82,102]).
government sponsors. In many cases, output sentences are Dialogue modelling means different things to different
simply word strings, in text or pre-recorded acoustic for- people. For some, it includes the planning and problem
mat, that are invoked when appropriate. In some cases, solving aspects of human computer interactions [1]. In
sentences are generated by concatenating templates after the context of this paper, we define dialogue modelling
filling slots by applying recursive rules along with appropri- as the preparation, for each turn, of the system’s side of
ate constraints (person, gender, number, etc.) [32]. There the conversation, including verbal, tabular, and graphical
has also been some recent work using more corpus-based response, as well as any clarification requests.
methods for language generation in order to provide more Dialogue modelling and management serves many roles.
variation in the surface realization of the utterance [65]. In the early stages of the conversation the role of the di-
alogue manager might be to gather information from the
C.2 Speech Synthesis user, possibly clarifying ambiguous input along the way, so
The conversion of text to speech is the final stage of that, for example, a complete query can be produced for
output generation. TTS systems in the past were primar- the application database. The dialogue manager must be
ily rule driven, requiring the system developers to possess able to resolve ambiguities that arise due to recognition er-
extensive acoustic-phonetic and other linguistic knowledge ror (e.g., “Did you say Boston or Austin”) or incomplete
[47]. These systems are typically very intelligible, but suffer specification (e.g., “On what day would you like to travel”).
greatly in naturalness. In recent years, we have seen the In later stages of the conversation, after information has
emergence of a new, concatenative approach, brought on been accessed from the database, the dialogue manager
by inexpensive computation/storage, and the availability of might be involved in some negotiation with the user. For
large corpora [8,83]. In this corpus-based approach, units example, if there were too many items returned from the
excised from recorded speech are concatenated to form an database, the system might suggest additional constraints
utterance. The selection of the units is based on a search to help narrow down the number of choices. Pragmati-
procedure subject to a predefined distortion measure. The cally, the system must be able to initiate requests so that
output of these TTS systems is often judged to be more the information can be reduced to digestible chunks (e.g.,
natural than that of the rule-based systems [65]. “I found ten flights, do you have a preferred airline or con-
Currently in most conversational systems, the language necting city”).
generation and text-to-speech components are not closely In addition to these two fundamental operations, the dia-
coupled; the same text is generated whether it is to be read logue manager must also inform and guide the user by sug-
or spoken. Furthermore, systems typically expect the lan- gesting subsequent sub-goals (e.g., “Would you like me to
guage generation component to produce a textual surface price your itinerary?”), offer assistance upon request, help
form of a sentence (throwing away valuable linguistic and relax constraints or provide plausible alternatives when the
prosodic knowledge) and then require the text-to-speech requested information is not available (e.g., “I don’t have
component to produce linguistic analysis anew. Recently, sunrise information for Oakland, but in San Francisco ...”),
there has been some work in concept-to-speech generation and initiate clarification sub-dialogues for confirmation. In
[58]. Such a close coupling can potentially produce higher general the overall goal of the dialogue manager is to take
quality output speech than could be achieved with a de- an active role in directing the conversation towards a suc-
coupled system, since it permits finer control of prosody. cessful conclusion for the user.
Whether language generation and speech synthesis compo- The dialogue manager can influence other system com-
PUBLISHED IN PROCEEDINGS OF THE IEEE, VOL. 88, NO. 8, 1166-1180, AUGUST, 2000 1171

ponents by, for example, dynamically making dialogue systems, it is worth describing in more detail. The program
context-dependent adjustments to language models or dis- adopted the approach of developing the underlying input
course history. At the highest level, it can help detect the technologies within a common domain called Air Travel In-
appropriate broad subdomain (e.g., weather, air travel, or formation Service, or atis [76]. Atis permits users to query
urban navigation). Within a particular domain, certain for air travel information, such as flight schedules from one
queries could introduce a focus of attention on a subset of city to another, obtained from a small, static relational
the lexicon. For instance, in a dialogue about a trip to database excised from the Official Airline Guide. By re-
France, the initial user utterance, “I’m planning a trip to quiring that all system developers use the same database,
France,” would allow the system to greatly enhance the it was possible to compare the performance of various spo-
probabilities on all the French destinations. Finally, when- ken language systems based on their ability to extract the
ever the system asks a directed question, the language correct information from the database, using a set of pre-
model probabilities can be altered so as to favor appro- scribed training and test data, and a set of interpretation
priate responses to the question. For example, when the guidelines. Indeed, common evaluations occurred at regu-
system asks the user to provide a date of travel, the sys- lar intervals, and steady performance improvements were
tem could temporarily enhance the probabilities of date observed for all systems. At the end of the program, the
expressions in the response. best system achieved a word error rate of 2.3% and a sen-
There are many ways dialogue management has been tence error rate of 15.2% [68]. Additionally, the best system
implemented. Many systems use a type of scripting lan- achieved an understanding error rate of 5.9% and 8.9% for
guage as a general mechanism to describe dialogue flow text and speech input, respectively.3
(e.g., [15,92,98]). Other systems represent dialogue flow by The European SUNDIAL project differed in several ways
a graph of dialogue objects or modules (e.g., [5,101]). An- from the DARPA SLS program. Whereas the SLS program
other aspect of system implementation is whether or not had regular common evaluations, the SUNDIAL project
the active vocabulary or understanding capabilities change had none. Unlike the SLS program however, the SUNDIAL
depending on the state of the dialogue. Some systems are project aimed at building systems that could be publicly
structured to allow a user to ask any question at any point deployed. For this reason, the SUNDIAL project desig-
in the dialogue, so that the entire vocabulary is active at nated dialogue modelling and spoken language generation
all times. Other systems restrict the vocabulary and/or as integral parts of the research program. As a result, this
language which can be accepted at particular points in the has led to some interesting advances in Europe in dialogue
dialogue. The trade-off is generally one of increased user control mechanisms.
flexibility (in reacting to a system response or query), and Since the end of the SLS and SUNDIAL programs in
one of increased system understanding accuracy, due to the 1995, and 1993 respectively, there have been other spon-
constraints on the user input. sored programs in spoken dialogue systems. In the recently
completed ARISE (Automatic Railway Information Sys-
III. Recent Progress tems for Europe) project, which was a part of the LE3
In the past decade there has been increasing activity in program, participants developed train timetable informa-
the area of conversational systems, largely due to govern- tion systems covering three different languages (Dutch,
ment funding in the U.S. and Europe. By the late 1980’s French, and Italian) [24]. Groups explored alternative di-
the DARPA spoken language systems (SLS) program was alogue strategies, and investigated different technology is-
initiated in the U.S., while the Esprit SUNDIAL (Speech sues. Four prototypes underwent substantial testing and
Understanding and DIALog) program was underway in Eu- evaluation (e.g., [17,50,84]). In the U.S., a new DARPA
rope [71]. The task domains for these two programs were funded project called Communicator has begun, which
remarkably similar in that both involved database access emphasizes dialogue-based interactions incorporating both
for travel planning, with the European one including both speech input and output technologies. One of the proper-
flight and train schedules, and the American one being re- ties of this program is that participants are using a common
stricted to air travel. The European program was a mul- system architecture to encourage component sharing across
tilingual effort involving four languages (English, French, sites [91]. Participants in this program are developing both
German, and Italian), whereas the American effort was, their own dialogue domains, and a common complex travel
understandably, restricted to English. All of the systems task (e.g., [29]).
focused within a narrowly defined area of expertise, and In addition to the research sponsored by these larger pro-
vocabulary sizes were generally limited to several thousand grams, there have been many other independent initiatives
words. Nowadays, these types of systems can typically run as well. Although there are far too many to list here, some
in real-time on standard workstations and PCs with no ad- examples include the Berkeley Restaurant Project (BeRP),
ditional hardware. which provided restaurant information in the Berkeley, Cal-
Strictly speaking, the DARPA SLS program cannot be ifornia area [44]. The AT&T AutoRes system allowed users
considered conversational in that its attention focused en- to make rental car reservations over the phone via a toll-
tirely on the input side. However, since the technology 3 All the performance results quoted here are for the so-called
developed during the program had a significant impact on “evaluable” queries, i.e., those queries that are within domain and
the speech understanding methods used by conversational for which an appropriate answer is available from the database.
PUBLISHED IN PROCEEDINGS OF THE IEEE, VOL. 88, NO. 8, 1166-1180, AUGUST, 2000 1172

Domain Language Vocabulary Average


Size Words/Utt Utts/Dialogue
CSELT Train Timetable Info Italian 760 1.6 6.6
SpeechWorks Air Travel Reservation English 1000 1.9 10.6
Philips Train Timetable Info German 1850 2.7 7.0
CMU Movie Information English 757 3.5 9.2
CMU Air Travel Reservation English 2851 3.6 12.0
LIMSI Train Timetable Info French 1800 4.4 14.6
MIT Weather Information English 1963 5.2 5.6
MIT Air Travel Reservation English 1100 5.3 14.1
AT&T Operator Assistance English 4000 7.0 3.0
Air Travel Reservations (human) English ? 8.0 27.5

TABLE II
A comparison of several conversational systems that has been deployed and used by real users.

free number [55]. Their “How may I help you?” system the average number of words per utterance tends to in-
provides call routing services and information [36]. The crease as one moves from commercial systems, to research
waxholm system provides ferry timetables and tourist in- prototypes, to human-human dialogues. Naturally, there
formation for the Stockholm archipelago [11]. At the Uni- are many factors which affect averages, including the ba-
versity of Rochester, the TRAINS project involved train sic nature of the application. However, it is likely that
schedule planning [1]. systems that employ more system-initiative or directed di-
One of the most noticeable trends in spoken dialogue alogues (by asking the user to answer specific questions),
systems is the increasing number of publicly-deployed sys- or that require explicit confirmations, would also tend to
tems. Such systems include not only research prototypes, have fewer words per utterance on average. It is also ap-
but also commercial products which are used on a much parent that none of the human-machine dialogues were as
wider scale for domains such as call routing, stock quotes, wordy as those between humans.
train schedules, and flight reservations. (e.g., [4,5,10,64]).
IV. Development Issues
Although it can be difficult to compare different systems,
it is interesting to observe some of their basic properties. Spoken dialogue systems require first and foremost the
Table II shows some statistics of several different systems availability of high performance human language technol-
which have been deployed and used by real users. Each ogy components such as speech recognition and language
system is characterized in terms of the domain of opera- understanding. However, the development of these sys-
tion, language, vocabulary size, and the average number tems also demands that we pay close attention to a host
of words per utterance and utterances per dialogue. The of other issues. While many of these issues may have lit-
systems are listed in increasing order of average number of tle to do with human language technologies per se, they
words per utterance. The first three systems are examples are nonetheless crucial to successful system development.
of commercial products and/or have been deployed on a In this section, we will outline some of these development
very large scale (i.e., fielding millions of calls): the CSELT issues.
train timetable information system [10], the SpeechWorks
air travel reservation system [5], and the Philips TABA A. Working in Real Domains
train timetable information system [98]. The second group The objective for developing a conversational interface is
of six systems are examples of research prototypes which to provide a natural way for any user, especially the com-
have been made publicly available on a smaller scale. They puter illiterate, to access and manage information. Since
include movie information and air travel reservation sys- humans will ultimately be consumers of this technology,
tems developed at CMU [18,81], the LIMSI train timetable it is important that the systems be developed with their
information system [78], weather and air travel information behaviors and needs in mind. An effective strategy, and
systems developed at MIT [93,114], and the AT&T “How one that we subscribe to, involves the development of the
may I help you?” operator assistance system [36]. The final underlying technologies within real application domains,
statistics were computed from a set of 66 air travel reser- rather than relying on artificial scenarios, however real-
vation transactions between customers and agents which istic they might be. Such a strategy will force us to con-
were transcribed by SRI [48]. front some of the critical research issues that may otherwise
From the table, we can see that most of these systems elude our attention, such as dialogue modelling, new word
have vocabulary sizes in the thousands, although for other detection/learning, confidence scoring, robust recognition
domains such as stock quotes, the vocabulary size could of accented speech, and portability across domains and lan-
be considerably larger. It is interesting to observe that guages. We also believe that working on real applications
PUBLISHED IN PROCEEDINGS OF THE IEEE, VOL. 88, NO. 8, 1166-1180, AUGUST, 2000 1173

wizard-of-oz, data collection paradigm, in which an exper-


imenter types the spoken sentences to the system, after
removing spontaneous speech artifacts. This process has
the advantage of eliminating potential recognition errors.
The resulting data are then used for the development and
training of the speech recognition and natural language
components. As these components begin to mature, it
becomes feasible to collect more data using the “system-
in-the-loop,” or wizardless, paradigm, which is both more
realistic and more cost effective. Performance evaluation
using newly collected data will facilitate system refinement.
Fig. 4. Averaged number of dialogue turns for several application
domains. Limited
NL
Capabilities
has the potential benefit of shortening the interval between
Speech
technology demonstration and its deployment. Above all, Recognition
real applications that can help people solve problems will
be used by real users, thus providing us with a rich and Data Data
Collection Collection
continuing source of useful data. These data are far more (Wizard) (Wizardless )
useful than anything we could collect in a laboratory envi- Expanded
ronment. NL
What constitutes real domains and real users? One may Capabilities

rightfully argue that only commercially deployed systems Performance


Evaluation
capable of providing robust and scalable solutions can truly
be considered real. A laboratory prototype that is only
available to the public via perhaps a single phone line is
quite different from a commercially deployed system. Nev- System Refinement
ertheless, we believe a research strategy that incorporates
as much realism as possible early into the system’s research Fig. 5. Illustration of data collection procedures.
and development life cycle is far more preferable to one
that attempts to develop the underlying technologies in The means and scale of data collection for system devel-
a concocted scenario. At the very least, a research pro- opment and evaluation has evolved considerably over the
totype capable of providing real and useful information, last decade. This is true for both the speech recognition
made available to a wide range of users, offers a valuable and speech understanding communities, and can be seen
mechanism for collecting data that will benefit the devel- in many of the systems in the recent ARISE project [24],
opment of both types of systems. We will illustrate this and elsewhere. At MIT, for example, the voyager ur-
point in the next section with one of the systems we have ban navigation system was developed in 1989 by recruiting
developed at MIT. 100 subjects to come to our laboratory and ask a series of
How do we select the applications that are well matched questions to an initial wizard-based system [31]. In con-
to our present capabilities? The answer may lie in exam- trast, the data collection procedure for the more recent
ining human-human data. Figure 4 displays the average jupiter weather information system consists of deploying
number of dialogue turns per transaction for several ap- a publicly available system, and recording the interactions
plication domains. The data are obtained from the same [114]. There are large differences in the number of queries,
transcription of the 100 hours of real human-human inter- the number of users, and the range of issues which the data
actions described earlier. As the data clearly show, helping provide. By using a system-in-the-loop form of data collec-
a user select a movie or a restaurant is considerably less tion, system development and evaluation become iterative
complex than helping a user to look for employment. procedures. If unsupervised methods were used to augment
the system ASR and NLU capabilities, system development
B. Data Collection could become continuous (e.g., [46]).
Developing conversational interfaces is a classic chicken Figure 6 shows, over a two-year period, the cumulative
and egg problem. In order to develop the system capabil- amount of data collected from real users using the MIT
ities, one needs to have a large corpus of data for system jupiter system and the corresponding word error rates
development, training and evaluation. In order to collect (WER) of our recognizer. Before we made the system ac-
data that reflect actual usage, one needs to have a system cessible through a toll-free number, the WER was about
that users can speak to. Figure 5 illustrates a typical cycle 10% for laboratory collected data. The WER more than
of system development. For a new domain or language, tripled during the first week of data collection. As more
one must first develop some limited natural language ca- data were collected, we were able to build better lexical,
pabilities, thus enabling an “experimenter-in-the-loop,” or language, and acoustic models. As a result, the WER con-
PUBLISHED IN PROCEEDINGS OF THE IEEE, VOL. 88, NO. 8, 1166-1180, AUGUST, 2000 1174

tinued to decrease over time. This negative correlation evaluate (e.g., cross talk, mumbling, partial words). NLU
suggests that making the system available to real users is evaluation can also be performed by comparing some form
a crucial aspect of system development. If the system can of meaning representation with a reference. In [75], for
provide real and useful information to users, they will con- example, two metrics are measured on an utterance-by-
tinue to call, thus providing us with a constant supply of utterance basis which attempt to assess the performance
useful data. However, in order to get users to actually use of discourse and dialogue in addition to ASR and NLU.
the system, it needs to be providing “real” information to The first measures the average number of attributes intro-
the user. Otherwise, there is little incentive for people to duced per query (a measure of information rate), while the
use the system other than to play around with it, or to second measures how many turns it took, on average, for
solve toy problem scenarios which may or may not reflect an intended attribute to be transmitted successfully to the
problems of interest to real users. system (a measure of user frustration).
One problem with NLU evaluation is that there is no
45 100 common meaning representation among different research
40 Word

Training Data (x1000)


35 Data sites, so cross-site comparison becomes difficult. In the
Error Rate (%)

30 DARPA SLS program for example, the participants ulti-


25 10 mately could agree only on comparing to an answer coming
20
15 from a common database. Unfortunately, this necessarily
10 led to the creation of a large document defining principals
5 of interpretation for all conceivable queries [41]. In order to
0 1
Apr May Jun Jul Aug Nov Apr Nov May keep the response across systems consistent, systems were
restricted from taking the initiative, which is a major con-
Fig. 6. Comparison of recognition performance and the number of straint on dialogue research.
utterances collected from real users over time in the MIT weather
domain. Note that the x-axis has a nonlinear time scale, reflecting
One way to show progress for a particular system is to
the time when new versions of the recognizer were released. perform longitudinal evaluations for recognition and un-
derstanding. In the case of jupiter, as shown in Figure
6, we continually evaluate on standard test sets, which we
C. Evaluation can redefine periodically in order to keep from tuning to a
One of the issues which faces developers of spoken dia- particular data set [74,114]. Since data continually arrive,
logue systems is how to evaluate progress, in order to de- it is not difficult to create new sets and re-evaluate older
termine if they have created a usable system. Developers system releases on these new data.
must decide what metrics to use to evaluate their systems Some systems make use of dialogue context to provide
to ensure that progress is being made. Metrics can include constraints for recognition, for example, favoring candidate
component evaluations, but should also assess the overall hypotheses that mention a date after the system has just
performance of their system. asked for a date. Thus, any reprocessing of utterances in
For systems which conduct a transaction, it is possible order to assess improvements in recognition or understand-
to tell whether or not a user has completed a task. In these ing performance at a later time need to be able to take
cases, it is also possible to measure accompanying statistics advantage of the same dialogue context as was present in
such as the length of time to complete the task, the number the original dialogue with the user. To do this, the dialogue
of turns, etc. It has been noted however, that such statistics context must be recorded at the time of data collection, and
may not be as important as user satisfaction (e.g., [79]). re-utilized in the subsequent off-line processing, in order to
For example, a spoken dialogue interface may take longer avoid giving the original system an unwarranted advantage
than some alternative, yet users may prefer it due to other [75].
factors (less stressful, hands-free, etc). A better form of
V. Challenges
evaluation might be a measure of whether users liked the
system, whether they called to perform a real task (rather As we can see, considerable progress has been made over
than browsing), and whether they would use it again, or the past decade in research and development of systems
recommend it to others. Evaluation frameworks such as that can understand and respond to spoken language. To
paradise [104] attempt to correlate system measurements meet the challenges of developing a language-based inter-
with user satisfaction, in order to better quantify these face to help users solve real problems, however, we must
effects [105]. continue to improve the core technologies while expanding
Although there have been some recent efforts in eval- the scope of the underlying human language technology
uating language output technologies (e.g., TTS compar- base. In this section, we highlight some of the new research
isons [85]), evaluation methods for ASR and NLU have challenges that deserve our collective attention, realizing
been more common since they are more amenable to au- that the list is but a sampling of the entire landscape.
tomatic evaluation methods where it is possible to decide
what is a correct answer. ASR evaluation has tended to A. Spoken Language Understanding
be the most straightforward, although there are a range The development of conversational systems shares many
of phenomena which are not necessarily obvious how to of the research challenges being addressed by the speech
PUBLISHED IN PROCEEDINGS OF THE IEEE, VOL. 88, NO. 8, 1166-1180, AUGUST, 2000 1175

recognition community for other applications such as person wants to know how to go from the train station to
speech dictation and spoken document retrieval, although a restaurant whose name is unknown to the system, they
the recognizer is often exercised in different ways. For ex- will not settle for a response such as, “I am sorry I don’t
ample, in contrast to desktop dictation systems, the speech understand you. Please rephrase the question.” The sys-
recognition component in a conversational system is often tem must be able not only to detect new words, taking into
required to handle a wide range of channel variations. In- account acoustic, phonological, and linguistic evidence, but
creasingly, landline and cellular phones are the transducer also to adaptively acquire them, both in terms of their or-
of choice, thus requiring the system to deal with narrow thography and linguistic properties. In some cases, fun-
channel bandwidths, low signal-to-noise ratios, diversity in damental changes in the problem formulation and search
handset characteristics, drop-out, and other artifacts. In strategy may be necessary. While some research is being
many situations where speech input is especially appropri- conducted in this area [3,38], much more work remains to
ate (e.g., hands-busy/eyes-busy) there can be significant be done.
background noise (e.g., cars), and possibly stress on the For simple applications such as auto-attendant, it is pos-
part of the speaker. Robust conversational systems will be sible for a conversational system to achieve “understand-
required to handle these types of phenomena. ing” without utilizing sophisticated natural language pro-
Another problem that is particularly acute for conversa- cessing techniques. For example, one could perform key-
tional systems is the recognition of speech from a diverse word or phrase spotting on the recognizer’s output to ob-
speaker population. In the data we collected for jupiter, tain a meaning representation. As the interactions become
for example, we observed a significant number of children, more complex, involving multiple turns, the system may
as well as users with strong dialects and non-native accents. need more advanced natural language analysis in order to
The challenge posed by these data to speaker-independent achieve understanding in context.
recognition technology must be met [54], since conversa- Although there are many examples in the literature of
tional interfaces are intended to serve people from all walks both partial and fully unsupervised learning methods ap-
of life. plied to natural language processing, NLU in the context
A solution to these channel and speaker variability prob- of conversational systems remains mainly a knowledge in-
lems may be adaptation. For applications in which the en- tensive process. Even stochastic approaches that can learn
tire interaction consists of only a few queries, short-term the linguistic regularities automatically require that a large
adaptation using only a small amount of data would be nec- corpus be properly annotated with syntactic and semantic
essary. For applications where the user identity is known, tags [36,60,70]. One of the continuing challenges facing re-
the system can make use of user profiles to adapt not searchers is the discovery of processes that can automate
only acoustic-phonetic characteristics, but also pronunci- the discovery of linguistic facts.
ation, vocabulary, language, and possibly domain prefer- Competing strategies to achieve robust understanding
ences (e.g., user lives in Boston, prefers aisle seat when have been explored in the research community. For exam-
flying). ple, the system could adopt the strategy of first performing
An important problem for conversational systems is the word- and phrase-spotting, and rely on full linguistic anal-
detection and learning of new words. In a domain such as ysis only when necessary. Alternatively, the system could
jupiter or electronic Yellow Pages, a significant fraction first perform full linguistic analysis in order to uncover the
of the words uttered by users may not be in the system’s linguistic structure of the utterance, and relax the con-
working vocabulary. This is unavoidable partly because it straints through robust parsing and word/phrase-spotting
is not possible to anticipate all the words that all users are only when full linguistic analysis fails. At this point, it
likely to use, and partly because the database is usually is not clear which of these strategies would yield the best
changing with time (e.g., new restaurants opening up). In performance. Continued investigation is clearly necessary.
systems such as jupiter, users will sometimes try to help
B. Spoken Language Generation
the system with unknown city names by spelling the word
(e.g., “I said B A N G O R, Bangor”), or emphasizing With few exceptions, current research in spoken language
the syllables in the word (which usually leads to worse re- systems has focused on the input side, i.e., the understand-
sults!). In the past, we have not paid much attention to the ing of the input queries, rather than the conveyance of the
unknown word problem because the tasks the speech recog- information. It is interesting to observe, however, that the
nition community has chosen often assume a closed vocab- speech synthesis component is the one that often leaves the
ulary. In the limited cases where the vocabulary has been most lasting impression on users - especially when it does
open, unknown words have accounted for a small fraction of not sound especially natural. As such, more natural sound-
the word tokens in the test corpus. Thus researchers could ing speech synthesis will be an important research topic for
either construct generic “trash word” models and hope for spoken dialogue systems in the future.
the best, or ignore the unknown word problem altogether Spoken language generation is an extremely important
and accept a small penalty on word error rate. In real ap- aspect of the human-computer interface problem, espe-
plications, however, the system must be able to cope with cially if the transactions are to be conducted over a tele-
unknown words simply because they will always be present, phone. Models and methods must be developed that will
and ignoring them will not satisfy the user’s needs – if a generate natural sentences appropriate for spoken output,
PUBLISHED IN PROCEEDINGS OF THE IEEE, VOL. 88, NO. 8, 1166-1180, AUGUST, 2000 1176

across many domains and languages. For applications ing speed when integrating prosodic information into the
where all information must be conveyed aurally, particu- search component during recognition [63].
lar attention must be paid to the interaction between lan-
guage generation and dialogue management – the system C. Dialogue Management
may have to initiate a clarification sub-dialogue to reduce In most current dialogue systems, the design of the di-
the amount of information returned from the back-end, in alogue strategy is typically hand-crafted by the system
order not to generate unwieldy verbal responses. developers, and as such is largely based on their intu-
As mentioned earlier, recent work in speech synthesis ition about the proper dialogue flow. This can be a
based on non-uniform units has resulted in much improved time-consuming process, especially for mixed-initiative di-
synthetic speech quality [42,83]. However, we must con- alogues, whose result may not generalize to different do-
tinue to improve speech synthesis capabilities, particularly mains. There has been some recent research exploring the
with regard to the encoding of prosodic and possibly par- use of machine learning techniques to automatically deter-
alinguistic information such as emotion. As is the case mine dialogue strategy [52]. Regardless of the approach,
on the input side, we must also explore integration strate- however, there is the need to develop the necessary infras-
gies for language generation and speech synthesis. Finally, tructure for dialogue research. This includes the collection
evaluation methodologies for spoken language generation of dialogue data, both human-human and human-machine.
technology must be continued to be developed, and more These data will need to be annotated, after developing an-
comparative evaluations performed [85]. notation tools and establishing proper annotation conven-
Many researchers have observed that the precise wording tions. In the last decade, speech recognition and language
of the system response can have a large impact on the user understanding communities have benefitted from the avail-
response. In general, the more vaguely worded response ability of large, annotated corpora. Similar efforts are des-
will result in the larger variation of inputs [5,78]. Which perately needed for dialogue modelling. Organizations such
type of response is more desirable will perhaps depend on as the Special Interest Group on Dialogue (SIGdial) of the
whether the system is used for research or commercial pur- Association for Computational Linguistics aim to advance
poses. If the final objective is to improve understanding of a dialogue research in areas such as portability, evaluation,
wider variety of input, then a more general response might resource sharing, and standards, among others [95].
be more appropriate. A more directed response, however, Since we are far from being able to develop omnipotent
would most likely improve performance in the short-term. systems capable of unrestricted dialogue, it is necessary for
The language generation used by most spoken dialogue current systems to accurately convey their limited capabil-
systems tends to be static, using the identical response pat- ities to the user, including both the domain of knowledge
tern in its interaction with users. While it is quite possible of the system itself, and the kind of speech queries that the
that users will prefer consistent feedback from a system, system can understand. While expert users can eventually
we have observed that introducing variation in the way we become familiar with at least a subset of the system ca-
prompt users for additional queries (e.g., “Is there anything pabilities, novices can have considerable difficulty if their
else you’d like to know?” “Can I help you with anything expectations are not well-matched with the system capabil-
else?”, “What else?”) is quite effective in making the sys- ities. This issue is particularly relevant for mixed-initiative
tem appear less robotic and more natural to users. It would dialogue systems; by providing more flexibility and freedom
be interesting to see if a more stochastic language genera- to users to interact with the system, one could potentially
tion capability would be well received by users. In addition, increase the danger of them straying out of the system’s
the ability to vary the prosody of the output (e.g., apply domain of expertise. For example, our jupiter system
contrastive stress to certain words) also becomes important knows only short-term weather forecasts, yet users ask a
in reducing the monotony and unnaturalness of speech re- wide-variety of legitimate weather questions (e.g., “What’s
sponses. the average rainfall in Guatemala in January?” or, “When
A more philosophical question for language generation is is high tide tomorrow?”) which are outside the system’s ca-
whether or not to personify the system in its responses to pabilities, along with a wide variety of non-weather queries.
users. Naturally, there are varied opinions on this matter. Even if users are aware of the system’s domain of knowl-
In many situations we have found that an effective response edge, they may not know the range of knowledge within the
is one commonly used in human-human interaction (e.g., domain. For example, jupiter does not know all 23,000
“I’m sorry”). Users do not seem to be bothered by the cities in the United States, so it is necessary to be able
personification evident in our deployed systems. to detect when a user is asking for an out-of-vocabulary
Although prosody impacts both speech understanding city, and then help inform the user what cities the system
and speech generation, prosodic features have been most knows without listing all possibilities. Finally, even if the
widely incorporated into text-to-speech systems. However, user knows the full range of capabilities of the system, they
there have been attempts to make use of prosodic infor- may not know what type of questions the system is able to
mation for both recognition and understanding [40,66,86], understand.
and it is hopeful that more research will appear in this area In order to assist users to stay within the capabilities
in the future. In the Verbmobil project, researchers have of the system, some form of “help” capability is required.
been able to show considerable improvement in process- However, it is difficult to provide help capabilities since
PUBLISHED IN PROCEEDINGS OF THE IEEE, VOL. 88, NO. 8, 1166-1180, AUGUST, 2000 1177

users may not know when to ask for it, and when they do, ily as an acoustic problem, with perhaps some interaction
the help request may not be explicit, especially if they do with a speech recognizer. However, it clearly should also
not understand why the system was misbehaving in the be viewed as an understanding problem, so that the sys-
first place. Regardless, the help messages will clearly need tem can differentiate among different types of input such
to be context dependent, with the system offering the ap- as noise, back-channel, or a significant question or state-
propriate suggestions depending on the dialogue states. ment, and take appropriate actions. In addition, it will be
Another challenging area of research is the recovery from necessary to properly update the dialogue status to reflect
the inevitable misunderstandings that a system will make. the fact that barge-in occurred. For example, if the system
Errors could be due to many different phenomena (e.g., was reading a list of flights, the system might need to re-
acoustics, speaking style, disfluencies, out-of-vocabulary member where the interruption occurred - especially if the
words, parse coverage, or understanding gaps), and it can interruption was under-specified (e.g., “I’ll take the United
be difficult to detect that there is a problem, determine flight,” or “Tell me about that one”).
what the problem is caused by, and convey to the user an
appropriate response that will fix the problem. D. Portability
Many systems incorporate some form of confidence scor- Creating a robust, mixed-initiative dialogue system can
ing to try to identify problematic inputs (e.g., [5,45]). The require a tremendous amount of effort on the part of re-
system can then either try an alternative strategy to help searchers. In order for this technology to ultimately be
the user, or back off to a more directed dialogue and/or one successful, the process of porting existing technology to
that requires explicit confirmation [78,81,99]. Based on our new domains and languages must be made easier. Over
statistics with jupiter, however, we have found that, when time, researchers have made the technology more modular.
an utterance is rejected, it is highly likely that the next ut- Over the past few years, different research groups have
terance will be rejected as well [69]. Thus, it appears that been attempting to make it easier for non-experts to cre-
certain users have an unfortunate tendency to go into a re- ate new domains. Systems which modularize their dialogue
jection death spiral which can be hard to get out of! Using manager try to take advantage of the fact that a dialogue
confidence scoring to perform partial understanding might can often be broken down into a set of smaller sub-dialogues
allow for more refined corrective dialogue, (e.g., requesting (e.g., dates, addresses), in order to make it easier to con-
input of only the uncertain regions). Partial understanding struct dialogue for a new domain (e.g., [5,101]). For exam-
may also help in identifying out-of-vocabulary words, and ple, researchers at OGI have developed rapid development
enable more constructive feedback from the system about kits for creating spoken dialogue systems, which are freely
the possible courses of action (e.g., “I heard you ask for available [101], and which have been used by students to
the weather in a city in New Jersey. Can you spell it for create their own systems [100]. On the commercial side
me?”). there has been a significant effort to develop the Voice eX-
Spoken dialogue systems can behave quite differently de- tensible Markup Language (VoiceXML) as a standard to
pending on what input and output modalities are available enable internet content and information access via voice
to the user. In displayless environments such as the tele- and phone [103]. To date these approaches have been ap-
phone, it might be necessary to tailor the dialogue so as plied only to directed dialogue strategies. Much more re-
not to overwhelm the user with information. When dis- search is needed in this area if we are to try to allow systems
plays are available, however, it may be more desirable to with complex dialogue strategies to generalize to different
simply summarize the information to the user, and to show domains.
them a table or image, etc. Similarly, the nature of the in- Currently, the development of speech recognition and
teraction will change if alternative input modalities, such language understanding technologies has been domain and
as pen or gesture, are available to the user. Which modality language specific, requiring a large amount of annotated
is most effective will depend, among other things, on en- training data. However, it may be costly, or even impos-
vironment (e.g., classroom), user preference, and perhaps sible, to collect a large amount of training data for cer-
dialogue state [67]. tain applications or languages. Therefore, we must address
Researchers are also beginning to study the addition the problems of producing a conversational system in a
of back-channel communication in spoken dialogue re- new domain and language given at most a small amount
sponses, in order to make the interaction more natural. of domain-specific training data. To achieve this goal, we
Prosodic information from fundamental frequency and du- must strive to cleanly separate the algorithmic aspects of
ration appear to provide important clues as to when back- the system from the application-specific aspects. We must
channelling might occur [62,111]. Intermediate feedback also develop automatic or semi-automatic methods for ac-
from the system can also be more informative to the user quiring the acoustic models, language models, grammars,
than silence or idle music when inevitable delays occur in semantic structures for language understanding, and di-
the dialogue (e.g., “Hold on while I look for the cheapest alogue models required by a new application. The issue
price for your flight to London...”). of portability spans across different acoustic environments,
Finally, many systems are able to handle interruptions databases, knowledge domains, and languages. Real de-
by allowing the user to “barge in” over the system response ployment of multilingual spoken language technology can-
(e.g., [5,64,81]). To date, barge-in has been treated primar- not take place without adequately addressing this issue.
PUBLISHED IN PROCEEDINGS OF THE IEEE, VOL. 88, NO. 8, 1166-1180, AUGUST, 2000 1178

VI. Concluding Remarks imental Dialogue System: Waxholm,” Proc. Eurospeech, 1867-
1870, 1993.
In this paper, we have attempted to outline some of the [12] R. Bobrow, R. Ingria, and D. Stallard, “Syntactic and Seman-
important research challenges that must be addressed be- tic Knowledge in the DELPHI Unification Grammar,” Proc.
fore spoken language technologies can be put to pervasive DARPA Speech and Natural Language Workshop, 230–236,
1990.
use. The timing for the development of human language [13] L. Boves and E. den Os, “Applications of Speech Technology:
technology is particularly opportune, since the world is mo- Designing for Usability,” Proc. IEEE Worshop on ASR and Un-
bilizing to develop the information highway that will be derstanding, 353–362, 1999.
[14] B. Buntschuh et al., “VPQ: A Spoken Language Interface to
the backbone of future economic growth. Human language Large Scale Directory Information,” Proc. ICSLP, 2863–2866,
technology will play a central role in providing an interface 1998.
that will dramatically change the human-machine commu- [15] R. Carlson and S. Hunnicutt, “Generic and Domain- Specific
nication paradigm from programming to conversation. It Aspects of the Waxholm NLP and Dialogue Modules,” Proc.
ICSLP, 677-680, 1996.
will enable users to efficiently access, process, manipulate, [16] J. Cassell, “Embodied Conversation: Integrating Face and Ges-
and absorb a vast amount of information. While much ture into Automatic Spoken Dialogue Systems,” to appear, Spo-
work needs to be done, the progress made collectively by ken Dialogue Systems, Luperfoy (ed.), MIT Press.
[17] G. Castagnieri, P. Baggia, and M. Danieli, “Field Trials of the
the community thus far gives us every reason to be op- Italian ARISE Train Timetable System,” Proc. IVTTA, 97–102,
timistic about fielding such systems, albeit with limited 1998.
capabilities, in the near future. [18] CMU Movieline, http://www.speech.cs.cmu.edu/Movieline
[19] Y. Chow and R. Schwartz, “The N-Best Algorithm: An Effi-
cient Procedure for Finding Top N Sentence Hypotheses,” Proc.
VII. Acknowledgements ARPA Workshop on Speech and Natural Language, 199–202,
The authors would like to thank Paulo Baggia, An- 1989.
[20] M. Cohen, Z. Rivlin, and H. Bratt, “Speech Recognition in
drew Halberstadt, Lori Lamel, Guiseppe Riccardi, Alex the ATIS Domain using Multiple Knowledge Sources,” Proc.
Rudnicky, and Bernd Souvignier for providing the statis- DARPA Spoken Language Systems Technology Workshop, 257–
tics about their spoken dialogue systems used in this pa- 260, 1995.
per. The authors would also like to thank Joseph Po- [21] P. Cohen, M. Johnson, D. McGee, S. Oviatt, J. Clow, and
I. Smith, “The Efficiency of Multimodal Interaction: A Case
lifroni, Stephanie Seneff, and the two anonymous review- Study,” Proc. ICSLP, 249–252, 1998.
ers who made many helpful suggestions to improve the [22] P. Constantinides, S. Hansma, and A. Rudnicky, “A Schema-
overall quality of the paper. This research was supported based Approach to Dialog Control,” Proc. ICSLP, 409–412,
1998.
by DARPA under contract N66001-99-1-8904, monitored [23] P. Dalsgaard, L. Larsen, and I. Thomsen, (Eds.) Proc. ESCA
through Naval Command, Control and Ocean Surveillance Tutorial and Research Workshop on Spoken Dialogue Systems:
Center. Theory and Application, Vigsø, Denmark, 1995
[24] E. den Os, L. Boves, L. Lamel, and P. Baggia, “Overview of the
ARISE project,” Proc. Eurospeech, 1527–1530, 1999.
References [25] M. Denecke, and A. Waibel, “Dialogue Strategies Guiding Users
[1] J. Allen, et al., “The TRAINS project: A Case Study in Defining to Their Communicative Goals,” Proc. Eurospeech, 2227–2230,
a Conversational Planning Agent,” J. Experimental and Theo- 1997.
retical AI, 7, 7–48, 1995. [26] L. Devillers, and H. Bonneau-Maynard, “Evaluation of Dialog
[2] A. Anderson et al., “The HCRC Map Task Corpus,” Language Strategies for a Tourist Information Retrieval System,” Proc.
and Speech, 34(4), 351–366, 1992. ICSLP, 1187–1190, 1998.
[3] A. Asadi, R. Schwartz, and J. Makhoul, “Automatic Mod- [27] V. Digilakis et al., “Rapid Speech Recognizer Adaptation to New
elling for Adding New Words to a Large Vocabulary Continuous Speakers,” Proc. ICASSP, 765–768, 1999.
Speech Recognition System,” Proc. ICASSP, 305–308, 1991. [28] J. Dowding, J. Gawron, D. Appelt, J. Bear, L. Cherny, R. Moore,
[4] H. Aust, M. Oerder, F. Seide, and V. Steinbiss, “The Philips and D. Moran, “Gemini: A Natural Language System for Spoken
Automatic Train Timetable Information System,” Speech Com- Language Understanding,” Proc. ARPA Workshop on Human
munication, 17, 249–262, 1995. Language Technology, 21–24, 1993.
[5] E. Barnard, A. Halberstadt, C. Kotelly, and M. Phillips, “A Con- [29] M. Eskenazi, A. Rudnicky, K. Gregory, P. Constantinides,
sistent Approach to Designing Spoken-Dialog Systems,” Proc. R. Brennan, C. Bennett, and J. Allen, “Data Collection and
ASRU Workshop, Keystone, CO, 1999. Processing in the Carnegie Mellon Communicator,” Proc. Eu-
[6] M. Bates, R. Bobrow, P. Fung, R. Ingria, F. Kubala, J. Makhoul, rospeech, 2695–2698, 1999.
L. Nguyen, R. Schwartz, and D. Stallard, “The BBN/HARC [30] G. Flammia, “Discourse Segmentation of Spoken Dialogue: An
Spoken Language Understanding System,” Proc. ICASSP, 111– Empirical Approach,” Ph.D. Thesis, MIT, 1998.
114, 1993. [31] J. Glass, G. Flammia, D. Goodine, M. Phillips, J. Polifroni, S.
[7] N. Bernsen, L. Dybkjaer, and H. Dybkjaer, “Cooperativity in Sakai, S. Seneff, and V. Zue, “Multilingual Spoken-Language
Human-Machine and Human-Human Spoken Dialogue,” Dis- Understanding in the MIT Voyager System,” Speech Communi-
course Processes, 21(2), 213–236, 1996. cation, 17, 1–18, 1995.
[8] M. Beutnagel, A. Conkie, J. Schroeter, Y. Stylianou, and A. [32] J. Glass, J. Polifroni, and S. Seneff, “Multilingual Language Gen-
Syrdal, “The AT&T Next-Gen TTS System,” Proc. ASA, Berlin, eration across Multiple Domains,” Proc. ICSLP, 983–976, 1994.
1999. [33] D. Goddeau, “Using Probabilistic Shift-Reduce Parsing in
[9] J. Butzberger, H. Murveit, and M. Weintraub, “Spontaneous Speech Recognition Systems,” Proc. ICSLP, 321–324, 1992.
Speech Effects in Large Vocabulary Speech Recognition Applica- [34] D. Goddeau, H. Meng, J. Polifroni, S. Seneff, and S.
tions,” Proc. ARPA Workshop on Speech and Natural Language, Busayapongchai, “A Form-Based Dialogue Manager for Spoken
339–344, 1992. Language Applications,” Proc. ICSLP, 701–704, 1996.
[10] R. Billi, R. Canavesio, and C. Rullent, “Automation of Telecom [35] D. Goodine, S. Seneff, L. Hirschman, and M. Phillips, “Full In-
Italia Directory Assistance Service: Field Trial Results,” Proc. tegration of Speech and Language Understanding in the MIT
IVTTA, 11–16, 1998. Spoken Language System,” Proc. Eurospeech, 845–848, 1991.
[11] M. Blomberg, R. Carlson, K. Elenius, B. Granstrom, J. [36] A. Gorin, G. Riccardi, and J. Wright, “How may I help you?,”
Gustafson, S. Hunnicutt, R. Lindell, and L. Neovius, “An Exper- Speech Communication, 23, 113–127, 1997.
PUBLISHED IN PROCEEDINGS OF THE IEEE, VOL. 88, NO. 8, 1166-1180, AUGUST, 2000 1179

[37] B. Grosz and C. Sidner, “Plans for Discourse,” in Intentions in [63] E. Nöth, “On the Use of Prosody in Automatic Dialogue Un-
Communication. MIT Press, 1990. derstanding,” Proc. ESCA Workshop on Dialogue and Prosody,
[38] L. Hetherington and V. Zue, “New words: Implications for Con- 25–34, 1999.
tinuous Speech Recognition,” Proc. Eurospeech, 475-931, 1991. [64] Nuance Communications, http://www.nuance.com
[39] L. Hetherington, M. Phillips, J. Glass, and V. Zue, “A∗ Word [65] A. Oh, “Stochastic Natural Language Generation for Spoken
Network Search for Continuous Speech Recognition,” Proc. Eu- Dialog Systems,” M.S. Thesis, CMU, May 2000.
rospeech, 1533–1536, 1993. [66] M. Ostendorf, C. Wightman, and N. Veilleux, “Parse Scoring
[40] J. Hirschberg, “Communication and Prosody: Functional As- with Prosodic Information: An Analysis/Synthesis Approach,”
pects of Prosody,” Proc. ESCA Workshop on Dialogue and Computer Speech & Language, 7(3), 193–210, 1993.
Prosody, 7–15, 1999. [67] S. Oviatt, “Multimodal Interfaces for Dynamic Interactive
[41] L. Hirschman et al., “Multi-site Data Collection for a Spoken Maps”, Proc. Conference on Human Factors in Computing Sys-
Language Corpus,” Proc. DARPA Workshop on Speech and tems: CHI ’96, 95–102, 1996.
Natural Language, 7–14, 1992. [68] D. Pallett, J. Fiscus, W. Fisher, J. Garafolo, B. Lund, A. Mar-
[42] X. Huang, A. Acero, J. Adcock, H.W. Hon, J. Goldsmith, J. tin, and M. Pryzbocki, “Benchmark Tests for the ARPA Spo-
Liu, and M. Plumpe, “WHISTLER: A Trainable Text-to-Speech ken Language Program,” Proc. ARPA Spoken Language Systems
System,” Proc. ICSLP, 2387–2390, 1996. Technology Workshop, 5-36, 1995.
[43] E. Jackson, D. Appelt, J. Bear, R. Moore, and A. Podlozny, “A [69] C. Pao, P. Schmid, and J. Glass, “Confidence Scoring for Speech
Template Matcher for Robust NL Interpretation,” Proc. DARPA Understanding Systems,” Proc. ICSLP, 815–818, 1998.
Speech and Natural Language Workshop, 190–194, 1991. [70] K. Papineni, S. Roukos, and R. Ward, “Maximum Likelihood
[44] D. Jurafsky, C. Wooters, G. Tajchman, J. Segal, A. Stolcke, and Discriminative Training of Direct Translation Models,”
E. Fosler, and N. Morgan, “The Berkeley Restaurant Project,” Proc. ICASSP, 189–192, 1998.
Proc. ICSLP, 2139-2142, 1994. [71] J. Peckham, “A New Generation of Spoken Dialogue Systems:
[45] A. Kellner, B. Rueber, and H. Schramm, “Using Combined De- Results and Lessons from the SUNDIAL Project,” Proc. Eu-
cisions and Confidence Measures for Name Recognition in Auto- rospeech, 33-40, 1993.
matic Directory Assistance Systems,” Proc. ICSLP, 2859–2862, [72] R. Pieraccini, and E. Levin, “Stochastic Representation of Se-
1998. mantic Structure for Speech Understanding,” Speech Communi-
[46] T. Kemp, and A. Waibel, “Unsupervised Training of a Speech cation, 11, 283–288, 1992.
Recognizer: Recent Experiments,” Proc. Eurospeech, 2725– [73] R. Pieraccini, E. Levin, and W. Eckert, “AMICA: The AT&T
2728, 1999. mixed initiative conversational architecture,” Proc. Eurospeech,
[47] D. Klatt, “Review of text-to-speech conversion for English,” J. 1875–1879, 1997.
Acoust. Soc. Am., 82(3), 737–793, 1987. [74] J. Polifroni, S. Seneff, J. Glass, and T. Hazen, “Evaluation
[48] J. Kowtko and P. Price, “Data Collection and Analysis in the Air Methodology for a Telephone-based Conversational System,”
Travel Planning Domain,” Proc. DARPA Speech and Natural Proc. Int. Conf. on Lang. Resources and Evaluation, 42–50,
Language Workshop, October 1989. 1998.
[49] L. Lamel, S. Bennacef, J.L. Gauvain, H. Dartigues, and J. [75] J. Polifroni, and S. Seneff, “Galaxy-II as an Architecture for Spo-
Temem, “User Evaluation of the Mask Kiosk,” Proc. ICSLP, ken Dialogue Evaluation,” to appear Proc. Int. Conf. on Lang.
2875–2878, 1998. Resources and Evaluation, 2000.
[50] L. Lamel, S. Rosset, J.L. Gauvain, S. Bennacef, M. Garnier- [76] P. Price, “Evaluation of Spoken Language Systems: the Atis Do-
Rizet, and B. Prouts, “The LIMSI ARISE System,” Proc. main,” Proc. DARPA Speech and Natural Language Workshop,
IVTTA, 209–214, 1998. 91-95, 1990.
[51] R. Lau, G. Flammia, C. Pao, and V. Zue, “WebGalaxy - In- [77] E. Reiter and R. Dale, Building Natural Language Generation
tegrating Spoken Language and Hypertext Navigation,” Proc. Systems, Cambridge University Press, Cambridge, 2000.
Eurospeech, 883–886, 1997. [78] S. Rosset, S. Bennacef, and L. Lamel, “Design Strategies for Spo-
[52] E. Levin, R. Pieraccini, and W. Eckert, “Using Markov Decision ken Language Dialog Systems,” Proc. Eurospeech, 1535–1538,
Process for Learning Dialogue Strategies,” Proc. ICASSP, 201– 1999.
204, 1998. [79] A. Rudnicky, M. Sakamoto, and J. Polifroni, “Evaluating Spo-
[53] “Speech Perception by Humans and Machines,” Speech Commu- ken Language Interaction,” Proc. DARPA Speech and Natural
nication, 22(1), 1–15, 1997. Language Workshop, 150–159, October, 1989.
[54] K. Livescu, “Analysis and Modelling of Non-Native Speech for [80] A. Rudnicky, J.M. Lunati, and A. Franz, “Spoken Language
Automatic Speech Recognition,” S.M. Thesis, MIT, 1999. Recognition in an Office Management Domain,” Proc. ICASSP,
[55] S. Marcus, et al., “Prompt Constrained Natural Language – 829–832, 1991.
Evolving the Next Generation of Telephony Services,” Proc. IC- [81] A. Rudnicky, E. Thayer, P. Constantinides, C. Tchou, R. Shern,
SLP, 857–860, 1996. K. Lenzo, W. Xu, and A. Oh, “Creating Natural Dialogs in
[56] D. Massaro, Perceiving Talking Faces: From Speech Perception the Carnegie Mellon Communicator System,” Proc. Eurospeech,
to a Behavioral Principle, MIT Press. 1531–1534, 1999.
[57] D. McDonald and L. Bolc (Eds.), Natural Language Gener- [82] D. Sadek, “Design Considerations on Dialogue Systems: From
ation Systems (Symbolic Computation Artificial Intelligence), Theory to Technology - The Case of Artimis,” Proc. ESCA
Springer Verlag, Berlin, 1998. Workshop on Interactive Dialogue in Multi-Modal Systems, 173–
[58] K. McKeown, S. Pan, J. Shaw, D. Jordan, and B. Allen, “Lan- 188, Kloster Irsee, 1999.
guage Generation for Multimedia Healthcare Briefings, Proc. [83] Y. Sagisaka, N. Kaiki, N. Iwahashi, and K. Mimura, “ATR ν-
Applied Natural Language Proc., 1997. Talk Speech Synthesis System,” Proc. ICSLP, 483-486, 1992.
[59] H. Meng, S. Busayapongchai, J. Glass, D. Goddeau, L. Hether- [84] A. Sanderman, J. Sturm, E. den Os, L. Boves, and A. Cremers,
ington, E. Hurley, C. Pao, J. Polifroni, S. Seneff, and V. Zue, “A “Evaluation of the Dutch Train Timetable Information System
Conversational System in the Automobile Classifieds Domain,” developed in the ARISE Project,” Proc. IVTTA, 91–96, 1998.
Proc. ICSLP, 542–545, 1996. [85] J. van Santen, L. Pols, M. Abe, D. Kahn, E. Keller, and J.
[60] S. Miller, R. Schwartz, R. Bobrow, and R. Ingria, “Statisti- Vonwiller, “Report on the 3rd ESCA TTS Workshop Evaluation
cal Language Processing Using Hidden Understanding Models,” Procedure,” Proc. 3rd ESCA Workshop on Speech Synthesis,
Proc. ARPA Speech and Natural Language Workshop, 278–282, 329–332, Jenolan Caves, Australia, 1998.
1994. [86] E. Shriberg et al., “Can Prosody Aid the Automatic Classifica-
[61] R. Moore, D. Appelt, J. Dowding, J. Gawron, and D. Moran, tion of Dialog Acts in Conversational Speech?,” Language and
“Combining Linguistic and Statistical Knowledge Sources in Speech, 41, 439–447, 1998.
Natural-Language Processing for ATIS,” Proc. ARPA Spoken [87] S. Seneff, “Tina: A natural language system for spoken language
Language Systems Workshop, 261–264, 1995. applications,” Computational Linguistics, 18(1), 61–86, 1992.
[62] H. Noguchi and Y. Den, “Prosody-Based Detection of the Con- [88] S. Seneff, “Robust Parsing for Spoken Language Systems,” Proc.
text of Backchannel Responses,” Proc. ICSLP, 487–490, 1998. ICASSP, 189–192, 1992.
PUBLISHED IN PROCEEDINGS OF THE IEEE, VOL. 88, NO. 8, 1166-1180, AUGUST, 2000 1180

[89] S. Seneff, V. Zue, J. Polifroni, C. Pao, L. Hetherington, D. God- Victor W. Zue received the Doctor of Sci-
deau, and J. Glass, “The Preliminary Development of a Display- ence degree in Electrical Engineering from the
less Pegasus System,” Proc. ARPA Spoken Language Technol- Massachusetts Institute of Technology in 1976.
ogy Workshop, 212–217, 1995. He is currently a Senior Research Scientist at
[90] S. Seneff, D. Goddeau, C. Pao, and J. Polifroni, “Multimodal MIT, Associate Director of the Institute’s Lab-
discourse modelling in a multi-user multi-domain environment,” oratory for Computer Science, and head of its
Spoken Language Systems Group. Dr. Zue’s
Proc. ICSLP, 188–191, 1996.
main research interest is in the development of
[91] S. Seneff, E. Hurley, R. Lau, C. Pao, P. Schmid, and V. Zue, conversational interfaces to facilitate graceful
“Galaxy-ii: A Reference Architecture for Conversational Sys- human/computer interactions. He has taught
tem Development,” Proc. ICSLP, 931–934, 1998. many courses at MIT and around the world,
[92] S. Seneff, R. Lau, and J. Polifroni, “Organization, Communi- and he has also written extensively on this subject. He is a Fellow
cation, and Control in the galaxy-ii Conversational System,”, of the Acoustical Society of America. In 1994, he was elected Distin-
Proc. Eurospeech, 1271–1274, 1999. guished Lecturer by the IEEE Signal Processing Society.
[93] S. Seneff, R. Lau, J. Glass, and J. Polifroni, “The Mercury Sys-
tem for Flight Browsing and Pricing,” MIT Spoken Language
System Group Annual Progress Report, 23–28, 1999.
[94] S. Seneff and J. Polifroni, “A New Restaurant Guide Conver-
sational System: Issues in Rapid Prototyping for Specialized James R. Glass received the B.Eng. degree
Domain,” Proc. ICSLP, 665–668, 1996. from Carleton University, Ottawa, Canada, in
[95] SIGdial URL: http://www.sigdial.org/ 1982, the M.S. and Ph.D. degrees in Electrical
[96] R. Sproat, A. Hunt, M. Ostendorf, P. Taylor, A. Black, K. Lenzo, Engineering and Computer Science from the
and M. Edgington, “SABLE: A Standard for TTS Markup,” Massachusetts Institute of Technology, Cam-
Proc. ICSLP, 1719–1722, 1998. bridge, in 1984, and 1988, respectively. Since
then he has worked in MIT’s Laboratory for
[97] D. Stallard and R. Bobrow, “Fragment Processing in the DEL- Computer Science where he is currently a Prin-
PHI System,” Proc. DARPA Speech and Natural Language cipal Research Scientist and Associate Head of
Workshop, 305–310, 1992. the Spoken Language Systems Group. His re-
[98] V. Souvignier, A. Kellner, B. Rueber, H. Schramm, and F. Seide, search interests include acoustic-phonetic mod-
“The Thoughtful Elephant: Strategies for Spoken Dialogue Sys- eling, speech recognition and understanding, and corpus-based speech
tems,” IEEE Trans. SAP, 8(1), 51–62, 2000. synthesis. He served as a member of the IEEE Acoustics, Speech,
[99] J. Sturm, E. den Os, and L. Boves, “Dialogue Management in and Signal Processing, Speech Technical Committee from 1992-1995.
the Dutch ARISE Train Timetable Information System,” Proc. From 1997-1999 he served as an associate editor for the IEEE Trans-
Eurospeech, 1419–1422, 1999. actions on Speech and Audio Processing.
[100] S. Sutton, E. Kaiser, A. Cronk, and R. Cole, “Bringing Spoken
Language Systems to the Classroom,” Proc Eurospeech, 709–712,
1997.
[101] S. Sutton, et al., “Universal Speech Tools: The CSLU Toolkit,”
Proc. ICSLP, 3221–3224, 1998.
[102] D. Thomson and J. Wisowaty, “User Confusion in Natural Lan-
guage Services,” Proc. ESCA Workshop on Interactive Dialogue
in Multi-Modal Systems, 189–196, Kloster Irsee, 1999.
[103] VoiceXML URL: http://www.voicexml.org/
[104] M. Walker, D. Litman, C. Kamm, and A. Abella, “PARADISE:
A General Framework for Evaluating Spoken Dialogue Agents,”
Proc. ACL/EACL, 271–280, 1997.
[105] M. Walker, J. Boland, and C. Kamm, “The Utility of Elapsed
Time as a Usability Metric for Spoken Dialogue Systems,” Proc.
ASRU Workshop, 1167–1170, Keystone, CO, 1999.
[106] W. Wang, J. Glass, H. Meng, J. Polifroni, S. Seneff, and V. Zue,
“Yinhe: A Mandarin Chinese Version of the Galaxy System’”
Proc. Eurospeech, 351–354, 1997.
[107] N. Ward, “Using Prosodic Cues to Decide when to Produce
Back-Channel Utterances,” Proc. ICSLP, 1728–1731, 1996.
[108] W. Ward, “Modelling Non-Verbal Sounds for Speech Recog-
nition,” Proc. DARPA Workshop on Speech and Natural Lan-
guage, 47–50, 1989.
[109] W. Ward, “The CMU Air Travel Information Service: Un-
derstanding Spontaneous Speech,” Proc. ARPA Workshop on
Speech and Natural Language, 127–129, 1990.
[110] W. Ward, “Integrating Semantic Constraints into the SPHINX-
II Recognition Search,” Proc. ICASSP, II-17-20, 1994.
[111] W. Ward, and S. Issar, “Recent Improvements in the CMU
Spoken Language Understanding System,” Proc. ARPA Human
Language Technology Workshop, 213–216, 1996.
[112] J. Yi and J. Glass, “Natural-Sounding Speech Synthesis Using
Variable Length Units,” Proc. ICSLP, 1167–1170, 1998.
[113] V. Zue, S. Seneff, J. Polifroni, M. Phillips, C. Pao, D. Goddeau,
J. Glass, and E. Brill, “Pegasus: A Spoken Language Interface
for On-Line Air Travel Planning,” Speech Communication, 15,
331-340, 1994.
[114] V. Zue, S. Seneff, J. Glass, J. Polifroni, C. Pao, T. Hazen, and
L. Hetherington, “Jupiter: A Telephone-Based Conversational
Interface for Weather Information,” IEEE Trans. SAP, 8(1), 85–
96, 2000.

Potrebbero piacerti anche