SLU Review

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/3322032
Spoken language understanding
Article in IEEE Signal Processing Magazine · June 2008

DOI: 10.1109/MSP.2008.918413 · Source: IEEE Xplore
CITATIONS READS
92 556
6 authors, including:
Renato De Mori Dilek Hakkani-Tur

McGill University and University of Avignon Microsoft
321 PUBLICATIONS 4,552 CITATIONS 262 PUBLICATIONS 5,291 CITATIONS
SEE PROFILE SEE PROFILE
Michael F. Mctear Giuseppe Riccardi

Ulster University Università degli Studi di Trento
160 PUBLICATIONS 1,724 CITATIONS 216 PUBLICATIONS 3,515 CITATIONS
SEE PROFILE SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Spoken Language Understanding View project
Making Sense of Human Conversations View project
All content following this page was uploaded by Gokhan Tur on 19 April 2013.
The user has requested enhancement of the downloaded file.

SPOKEN LANGUAGE UNDERSTANDING: A SURVEY
Renato De Mori
LIA – BP 1228 – 84911Avignon CEDEX 9 (France)
renato.demori@univ-avignon.fr
ABSTRACT
A survey of research on spoken language understanding is presented. It covers aspects of knowledge

representation, automatic interpretation strategies, semantic grammars, conceptual language models,
semantic event detection, shallow semantic parsing, semantic classification, semantic confidence, active
learning
Index Terms— Spoken language understanding, conceptual language models, spoken conceptual
constituent detection, stochastic semantic grammars, semantic confidence measures, active learning.
1. INTRODUCTION
Epistemology, the science of knowledge, considers a datum as basic unit. A datum can be an object, an
action or an event in the world and can have time and space coordinates, multiple aspects and qualities
that make it different from others. A datum can be represented by an image or it can be abstract and be
represented by a concept.
Computer epistemology deals with observable facts and their representation in a computer.
Knowledge about the structure of a domain represents a datum by an object and groups objects into
classes by their properties.
Semantics deals with the organization of meanings and the relations between signs or symbols and what
they denote or mean [135]. Computer semantic performs a conceptualization of the world using well
defined elements of programming languages.
Spoken Language Understanding (SLU) is the interpretation of signs conveyed by a speech signal. This
is a difficult task because signs for meaning are mixed with other information like speaker identity and
environment. Signs to be used for interpretation have to be defined and extracted from signal. Meaning
is represented is by a computer language. Relations between signs and meaning are part of the
interpretation knowledge source (KS) and are applied by one or more processes controlled by strategies.
The knowledge used is often imperfect. The transcription of user utterances in terms of word hypotheses
is performed by an Automatic Speech Recognition (ASR) system which makes errors. Strategies of
some SLU systems perform transformations from signals to words, then from words to meaning. Some
strategies are also proposed to transform signals into basic semantic constituents to be further composed
into semantic structures.
Programming languages have their own syntax and semantic. The former defines legal programming
statements, the latter specifies the operations a machine performs when an instruction is executed.
Specifications are defined in terms of the procedures the machine has to carry out. Semantic analysis of
a computer program is essential for understanding the behavior of a program and its coherence with the
design concepts and goals. Formal logics can be used to describe computer semantics.
Computer programs conceived for interpreting natural language differ from the human process they
model. They can be considered as approximate models for developing useful applications, interesting
research experiments and demonstrations. Semantic representations in computers usually treat data as
objects respecting logical adequacy in order to formally represent any particular interpretation of a
sentence. Even if utterances, in general, convey meanings which may not have relations which can be
expressed in formal logics ([45], p. 287), formal logics have been considered adequate for representing
natural language semantics in many application domains. Logics used for representing natural
language semantic should be able to deal with intension (the essence of a concept) and extension (the
set of all objects which are instances of a given concept).
Computer systems interpret natural language for performing actions such as a data base access and
display of the results and may require the use of knowledge which is not coded into the sentence but
can be inferred from the system knowledge stored in long or short term memories. It is argued in [135]
that a specification for natural language semantics requires more than the transformation of a sentence
into a representation. In fact, computer representations should permit, among other things, legitimate
conclusions to be drawn from data [72].
Interpretation may require the execution of procedures that specify the truth conditions of declarative
statements as well as the intended meaning of questions and commands [135].
Problems and challenges in SLU can be grouped into the following groups: meaning representation,
definition and extraction of signs, conception of interpretation KS base on relations between signs and
meaning and between instances of meaning, processes for sign extraction, generation of hypotheses
about units of meaning also called semantic constituents and constituent composition into semantic
structures. As processes generate interpretation hypotheses, other challenging problems are the
evaluation of confidence for semantic hypotheses, the design of interpretation KSs using human
knowledge, automatic learning from annotated corpora, the collection and semantic annotation of
corpora.
This report reviews the history of SLU research with particular attention to the evolution of
interpretation paradigms, influenced by experimental results obtained with evaluation corpora. This
review integrates and complements reviews in [22,73].
2. COMPUTER REPRESENTATIONS OF MEANING
Computer representation of meaning is described by a Meaning Representation Language (MRL) which

has its own syntax and a semantic. MRL should follow a representation model coherent with a theory of
epistemology, taking into account, intension and extension, relations, reasoning, composition of
semantic constituents into structures, procedures for relating them with signs. The semantic knowledge
of an application is a knowledge base (KB). A convenient way for reasoning about semantic knowledge
is to represent it as a set of logic formulas. Formulas contain variables which are bound by constants and
may be typed. An object is built by binding all the variables of a formula or by composing existing
objects.
Semantic compositions and decisions about composition actions are the result of an inference process.
Basic inference problem is to determine whether KB = F which means that KB entails a formula F,
meaning that F is true in all possible variable assignments (worlds) for which KB is true.
In [135], the possibility of representing semantic relations with links between classes and objects is
discussed. The formulas in a KB describe concepts and their relations which can be represented in a
network called semantic network. A semantic network is made of nodes corresponding to entities and
links corresponding to relations. This model combines the ability to store factual knowledge and to
model associative connections between entities [135]. Examples of relations are composition
functions [44].
The structure of semantic networks can be defined by a graph grammar. Computer programming
classes and objects called frames can be defined to represent entities and relations in semantic
networks.
In a frame based MRL, grammar of frames is a model for representing semantic entities and their
properties. The language should be able to represent types of conceptual structures as well as instances
of them.
Such a grammar should generate frames describing general concepts and their specific instances. Part
of a frame is a data structure which represents a concept by associating to the concept name a set of
roles which are represented by slots. Finding values for roles corresponds to fill the frame slots. In
general, slots contain associations between names of aspects of an entity and its descriptions, also
called slot fillers which are constrained to respect certain given types. A slot filler can be the instance
of another frame. This is represented by a pointer from the filler to the other frame.
Acceptable frames for semantic representations in a domain can be characterized by frame grammars
which generate acceptable structures and have rules of the type:
<frame> : <frame-name> <slots>*

<slot> : <aspect-name> [<description> ”of potential slot fillers” : types]
<description>: <description-indicator> <bridge-name> <frame-name>.
A frame system is a network of frames.
Important examples of MRL semantics are discussed in [135]. Early frame representations were used to
represent facts about an object with a property list. For example, a specific address can be represented by
the following frame:
{a0001
instance_of address
loc Avignon
area Vaucluse
country France
street 1, avenue Pascal
zip 84000
}
Here a0001 is a handle that represents an instance of a class which is specified by the value of the first
slot. The other slots, made of a property name and a value, define the property list of this particular
instance of the class “address”.
The above frame can be derived [145], after skolemization from the following logic formula:
⎧ins tan ce _ of ( x, address) ∧ loc( x, Avignon) ∧ area ( x , Vaucluse) ∧ ⎫

( ∃x ) ⎨ ⎬
⎩∧ country( x , France) ∧ street ( x,1 avenue Pascal) ∧ zip( x,84000) ⎭
A definition, with a similar syntax, but with a different semantic is provided for the address class which
defines the structure of any address:
{address
loc TOWN
……attached procedures
area DEPARTMENT OR PROVINCE OR STATE
country NATION
street NUMBER AND NAME
zip ORDINAL NUMBER
}
This frame has a different semantic since it defines a prototype based on which many instances can be
obtained.
The semantic of an MRL can be described by procedures for generating instances of entities and
relations. This characterizes procedural semantics. Procedures for slot filling as well as for frame
evocation use methods.
Different frames may share slots with similarity links. There may be necessary and optional slots.
Fillers can be obtained by attachment of procedures or detectors (of e.g. noun groups), inheritance,
default.
Procedures can also be attached to slots with the condition in which they have to be executed. Examples
of conditions are when-needed, when-filled. Slots may contain expectations or replacements (to be
considered if slots cannot be filled).
Descriptions are attached to slots to specify constraints. Given a slot-filler for a slot, the attached
description can be inferred. Descriptions can be instantiations of a concept carrier and can inherit its
properties. Descriptions may have connectives, coreferential (descriptions attached to a slot are attached
to another and vice-versa), declarative conditions.
Verbs are fundamental components of natural language sentences. They represent actions for which
different entities play different roles. Actions reveal how sentence phrases and clauses are semantically
related to verbs by expressing cases for verbs. A case is the name of a particular role that a noun phrase
or other component takes in the state or activity expressed by the verb in a sentence. There is a case
structure for each main verb. Attempts were made for mapping specific surface cases into a deep
semantic representation expressing a sort of semantic invariant. Many deep semantic representations are
based on deep case n-ary relations between concepts as proposed by Fillmore [31]. Deep case systems
have very few cases each one representing a basic semantic constraint.
Case determination may depend on syntactic information (case signals) as well as feature checking (case
conditions) and can be done by a case function. This function may return the likelihood that a given
preposition term serves the case relationship to the main verb of the sentence. It is possible to use a
variable number of preemptive levels. A case function may return a value which preempts any previous
use of that case.
Cases proposed in (Fillmore, 1968) are:
Agentive (A) - the animate instigator for an action,

Instrumental (I) - inanimate force or object causally involved,
Dative (D) - animate being affected,
Factitive (F) - object or being resulted from an action,
Locative (L) - location or spatial orientation,
Objective (O) -determined by verb.
The verb determines a predicate P which has associated cases like in ”push [O {A} {I}]”. The predicate
means that push must have an object O; as {} means optional, push may have an agent A and a force I.
The case structure of P is a set of sequences of cases. Cases which are properties of a verb are inner cases
(in particular the obligatory ones). Verbs typically specify up to three inner cases, at least one of which
must always be realized in any sentence using the verb. Sometimes a particular case must always be
present. Verb predicates have arguments characterized by semantic roles which can be cases. Predicates
and arguments of this type are related by linguistic structures.
Case structures are Relation Semantic Structures (RSS). Values for roles can be complex objects, called
Object Semantic Structures (OSS), characterized by properties (Waltz 1981).
A verb with its cases and other roles can be represented by a frame as in the following example:
{accept
is_a : verb
subject [human…..]
theme [………….]
…………………
Other roles…… [………….]
}
Between brackets are represented constraints on the types of values obtained with the slot filling
procedures. Slots with constraints about possible fillers have assertional import, while slots filled by
objects have structural import. Other representations can be added such as mass terms, adverbial
modification, probabilistic information, degree of certainty, time and tense.
An instance of a verb frame could be:
{V003
instance_of : accept
subject user
theme [service_004]
…………………
Other roles…… [………….]
}
which is representation of the predicate accept (user, service_004)
Humans communicate with computers with discourse actions that include speech acts. An example of
speech act is a request, like in the following question:
“What is the zip code of Avignon?”.
Using notations in (Allen, 1987), the meaning of this question can be represented in logic form by the
following logical sentence:
REQUEST(user, system, INFORMREF (y, loc(G1,Avignon ∧ zip(G1,y)))
A speech act REQUEST is a function having as arguments the user which is the agent of the request, the
system which is the destination of the request and the theme which is the result of the function
INFORMREF. The function INFORMREF returns the value of y for which loc(G1,Avignon ∧
zip(G1,y)) is true. G1 is a constant obtained by skolemization of the existential quantifier in the logical
description of frame a0001. REQUEST returns the value of INFORMREF.
In order to evaluate the value of INFORMREF (y, loc(G1,Avignon ∧ zip(G1,y)), a data base is
consulted. In a relational database, a conceptual entity corresponds to a table, and an instance
corresponds to a row. Its content is logically represented by a collection of instances of the frame
address.
Interesting books[44,4,2, 146, 148] describe various types of semantic knowledge and their use. A
common aspect of many of them is that it is possible to represent complex relational structures with non-
probabilistic schemes that are more effective than context-free grammars. For example, in KL-ONE [147]
concept descriptions account for the internal structure with role|filler descriptions and for a global
structural description (SD). Roles have substructures with constraints specifying types and quantities of
fillers. Sds are logical expressions indicating how role fillers interact. Role descriptions contain value
restrictions. Epistemological relations are defined for composing conceptual structures. They may
connect formal objects of the same type and account for inheritance. It is important to point out that
semantic knowledge is, in general, context-sensitive.
.Semantic relations are used to compose instances of conceptual structures. Functions are examples of
relations with their arguments. An example of a function with one argument, using the notation in[44] is:
[ Place [
IN( Thing ]]
LOC )
Subscripts are ontological category variables. IN indicates a function whose argument follows between
parentheses
Selectional restrictions are general semantic restrictions on arguments. In the above example, LOC is a
restriction for Thing. Restrictions use senses which are part of type hierarchies or ontologies.
Disjunctions are represented within curly brackets.
An interpretation is an instance of a semantic structure in which a restriction is bound to a type|token

pair, such as (City Paris) for LOC.
General basic building blocks for conceptual structures are lexical items with associated constraints of
various types and patterns of semantic structures.
In [81], schemas containing roles and other information are proposed as active structures to model
events and capture sequentiality.
A popular example of MRL is the Web Ontology Language (OWL) [87] which integrates some of the
most important requirements for computer semantic representation..
A heterarchical architecture based on a KB made of situation-action (production) rules is described in

[29].
3. SYNTACTIC AND SEMANTIC ANALYSIS FOR INTERPRETATION
A generic architecture structure for performing the SLU process is shown in Figure 1. KS indicates
knowledge sources which are stored in a long term memory with the acoustic models (AM) and
language models (LM), while hypotheses are written into a short term memory (STM). The content of
the STM can be used for adapting some KSs.
An initial, considerable effort in SLU research was made with an ARPA project started in 1971. The
project is reviewed in [55] and included approaches mostly based on Artificial Intelligence (AI) for
combining syntactic analysis and semantic representation in logic form. Some early SLU systems have
an architecture shown in Figure 2. A sequence of word hypotheses is generated by an ASR system.
Interpretation is performed with the same approaches used for written text. S indicates speech. W
indicates written text. Control indicates control strategies.
learning
Long Term Memory : AM LM interpretation KSs

speech
speech to conceptual structures and MRL
signs concept structures

words concept tags MRL description
Short Term Memory
dialogue
Figure 1 – Generic architecture structure for performing the SLU process
speech
ASR S control ASR KS

meaning
W KS
W control
text
WLU
Figure 2 – System architecture of early SLU systems
It is not clear how concepts relate to words. The knowledge that relates these two levels has to contain
patterns of word sequences for each conceptual constituent. Patterns used for detecting different
constituents in the same sentence may share components and may capture context dependences, but are
of finite length because sentences, especially spoken sentences, contain a finite and often small number of
words. Finite state models are thus appropriate for concisely representing these patterns.
On the contrary, semantic relations may use components hypothesized in different sentences and generate
structures which may belong to a context sensitive language. Sequences of conceptual constituents may
have to satisfy constraints which are different from the constraints imposed on words expressing a
conceptual constituent. Furthermore, semantic relations are language independent, while relations
between conceptual constituents and words are language dependent.
It was assumed, as stated for example in [128], that a semantic analyzer has to work with a syntactic
analyzer and produce data acceptable to a logical deductive system. This is motivated by arguments, for
example in [44], that each major syntactic constituent of a sentence maps into a conceptual constituent,
but the inverse is not true. For example, adapting the notation in [44], a sentence requiring a restaurant
near the Montparnasse metro station in Paris can be represented by the following bracketed conceptual
structure expression:
Γ:[Action REQUEST ([Thing RESTAURANT], [Path NEAR ([Place IN ([Thing

MONTPARNASSE])])]]
The formalism is based on a set of categories. Each category, e.g. Place can be elaborated as a Place-
function, e.g. IN and an argument.
The expression Γ can be obtained from a syntactic structure like this:
Ψ:[S[VP [V give, PR me] NP [ART a, N restaurant] PP[PREP near, NP [N Montparnasse, N station]]]]
Concerning the relation between syntax and semantics, in [44], it is observed that:
• Each major syntactic constituent of a sentence maps into a conceptual constituent, but the inverse
is not true.
• Each conceptual constituent supports the encoding of units (linguistic, visual,…).
• Many of the categories support type|token distinction (e.g; place_type place_token).
• Many of the categories support quantification.
• Some realizations of conceptual categories in conceptual structures can be decomposed into a

function|argument structure.
• Various types of relations, such as IS_A, PART_OF., hold between conceptual constituents.
These relations can be used to infer the presence of a constituent in a sentence given the presence
of other constituents.
Assuming that natural languages are susceptible to the same kind of semantic analysis as programming
languages, in [78], it is suggested that each syntactic rule of a natural language generative grammar is
associated with a semantic building procedure that turns the sentence into a logic formula.
An association of semantic building formulas with syntactic analysis is proposed in categorial

grammars conceived for obtaining a surface semantic representation [62].
The syntax of a language is seen as an algebra, grammatical categories are seen as functions. Lexical
representations have associated a syntactic pattern that suggests possible continuations of the syntactic
analysis and the semantic expression to be generated, as shown in the following fragment of the lexicon:
write (S\NP)|NP λxλy ((WRITE x) y)

Mary S|(S\NP) λf (f Mary)
a NP|N λx (an x)
letter N.
Elements are associated with a syntactic category which identifies them as functions and specifies
the type and directionality of their arguments and the type of their results. So, in the example ”Mary
writes a letter”, the lexical entry <writes (S\NP)|NP> causes the fact that when ”writes” in the data is
matched with the lexical entry for ”write”, the associated function (S\NP)|NP is applied. The symbol |
indicates a forward function application that looks for a match with an NP following ”writes” and
requires the evaluation of the function (S\NP). The word ”a” has lexical entry < a NP|N>. This causes
the execution of another forward function application that looks for a noun following ”a”. As the noun is
found (<letter N>), the semantic function λ x (an x)is executed, returning (a letter) which is associated
to the assertion of NP that now matches the expectation of (S\NP)|NP with (a letter). The x of λ x λ y
((WRITE x) y), is bound to (a letter) leading to λ y ((WRITE a letter) y). Now the backward function
S\NP has to be executed. The symbol \ means that the function will look backward for a match with a
lexical entry with label NP which is found by performing the forward execution of the function
associated with the lexical entry
< Mary S|(S\NP)>.
The function considers the assertion of S if what follows is asserted. This is true because it is the
backward expectation of the verb and NP is a rewriting for Mary. As a result of matching, y is bound to
Mary, producing the semantic representation ((WRITE a letter) Mary) and causing the assertion of the
start symbol S with which the analysis of the sentence to be interpreted is successfully completed.
Parsing a sentence results in asserting logical sentences from which frames can be instantiated and slots
filled by suitable procedures. Semantic knowledge is associated, in this case with lexical entries and logic
formulas are composed by actions performed during parsing. Composition knowledge is associated with
grammar rules and is seen as a grammar augmentation.
Semantic knowledge is associated, in this case with lexical entries and logic formulas are composed by
actions performed during parsing. The use of a lexicon with Montague grammars is discussed in detail
in [26].
Organization of lexical knowledge for sentence interpretation has been recently the object of
investigation. VerbNet [54], is a manually developed hierarchical verb lexicon. For each verb class,
VerbNet specifies the syntactic frames along with the semantic role assigned to each slot of a frame.
Modelling joint information about the argument structure of a verb is proposed in [123]. In the
WordNet Project [75], a word is represented by a set of synonymous senses belonging to an alphabet of
synsets. It can be used for word sense disambiguation.
Suitable procedures can be attached to frames to generate logical sentences from slots filled are filled.
Details on the use of syntax and semantics for natural language understanding can be found in [2].
Slot filling procedures can be executed under the control of a parser or, in general, by precondition-
action rules. As natural language is context sensitive, procedural networks for parsing under the
control of Augmented Transition Network Grammars (ATNG) were proposed. ATNGs [134] are
augmentations of Transition Network Grammars (TNGs). TNGs are made of states and arcs. The input
string is analyzed during parsing from left to right, one word at a time. The input word and the active
state determine the arc followed by the parser. Arcs have types, namely CAT (to read an input
symbol), PUSH (to transfer the control to a sub-network) and POP (to transfer the control from a sub-
network to the network that executed the PUSH to it).
In ATNGs condition testing and register setting actions are associated to certain arcs. Actions set the
content of registers with linguistic feature values and can also be used for building parse trees. It is
also possible to introduce actions of the type BUILD associated to an arc to compose a parse tree or to
generate semantic interpretations.
An example of ATNG is shown in Figure 3.
DET
N
NP
SETR NP
(current word) POP
JMP
Figure 3 –Example of ATNG
Different ATNGs can be used in cascade for parsing and interpretation. An arc type TRANSMIT
transfers syntactic structures from the syntactic to the semantic ATNG. Augmentations for generating
semantic hypotheses are shown in the following example:
(atn np
(CAT determiner)
(optional*
(CAT adjective))
(CAT noun)
(BUILD semantics ……..))
If a portion of a parse tree can be mapped into a semantic symbol of an MRL, then this symbol could
be used as a nonterminal in a grammar which integrates syntactic and semantic knowledge. In [135],
syntactic, semantic and pragmatic knowledge are integrated into procedural semantic grammar
networks in which symbols for sub networks can correspond to syntactic or semantic entities.
An example of a portion of parse tree containing semantic non terminal symbols is shown in the
following grammar fragment:
TOLOC -> to CITY | ……..

CITY -> .London | …….
In [139], TNGs are proposed as procedural attachment to frame slots. A chart parser can be activated
for each TNG under the conrol of the interpretation strategy. In [133], a search algorithm was
implemented in which the TNG was employed during ASR decoding.
There are several ways of using syntactic and semantic analysis. In most systems, a semantic analyzer
has to work with a syntactic analyzer and produce input for a logical deductive system.
Grammars can be represented in logic form and parsing can be seen as theorem proving or problem
solving. Syntactic and semantic knowledge can be represented with a single logic formalism. Attempts
have been made to do the opposite and represent everything by a grammar (Woods 1976, pragmatic
grammars). This can be used in an architecture having the scheme shown in Figure 4.
learning
AM LM linguistic KS
speech
ASR interpretation

word lattices MRL description
Short Term Memory
dialogue
Figure 4 – Architecture with an integrated interpretation knowledge
In [127] a best first parser is used. Its results trigger activations in a partitioned semantic network with
which inferences and predictions are performed by spreading node activation through links. Tree
Adjoining grammars (TAG) also integrate syntax and logic form (LF) semantics [114].
Classification based parsing may use Functional Unification grammars (FUG), Systemic Grammars
(SG), or Head Driven Phrase Structure Grammars (HDPSG) which are declarative representations of
grammars with logical constraints stated in terms of features and category structure. Semantics may
also drive the parser, causing it to make attachments in the parse tree. Semantics can resolve
ambiguities and translate English words into semantic symbols using a discriminant net for
disambiguation.. A interesting example of interleaving syntax and semantics in a parser is proposed in
[25].
Semantic parsing is discussed in [144]. A semantic first parser is described in [143].

Simple grammars are used for detecting possible clauses, then classification-based parsing completes the
analysis with inference [51].
4. PARTIAL PARSING AND FALLBACK FOR SLU
Early experiment is SLU made it clear the necessity of analyzing portions of a sentence when the
complete sentence could not be analyzed. Problems of this type may be due to the fact that spoken
language very often does not follow a formal grammar, hesitations and repetitions are frequent and
available parsers do not ensure full coverage of possible sentence even in the case of written text.
Furthermore, ASR systems make errors and grammar coverage was limited even for written text.
In [136], ATNGs were proposed to interpret parts of a sentence using a middle out analysis of the input
words. A scope specification is associated with grammar actions. Parsing can proceed to the left or to the
right of the input word. Scope specification indicates a set of states the parser has to have passed through
before the action can be safely performed. If this is not the case, the action is delayed.
Another approach to avoid parsing an entire sentence eunder the control of a single grammar consists in
using specific TNGs for each frame slot as in the Phoenix system [139] . In early versions of the system,
the input to Phoenix was the top hypothesis of the speech recognition component. Subsequently [133],
a search algorithm was implemented in which information from the TNG slot parsers was employed
during the A* portion of the recognizer. Adopting the more conventional approach, in which the natural
language component rescores a set N-best hypotheses generated with standard N-gram language models,
did not yield better recognition performance. On the other hand, it did yield significant improvement in
understanding performance. The score for a frame was simply the number of words in an utterance it
accounts for, though certain non-content words are ignored.
In [112], it is proposed to relax parser constraints when a sentence parser fails. This will permit the
recovery of phrases and clauses that can be parsed. Fragments obtained in this way are then fused
together.
Other solutions for partial parsing were proposed using finite state grammars. As stochastic versions of
them were developed, they will be reviewed later on.
More complex systems using fallback were proposed. They are described in some detail in (De Mori,
1998, ch. 14) and briefly reviewed in the following.
The Delphi system [10] contains a number of levels, namely, syntactic (using Definite Clause
Grammar, DCG), general semantics, domain semantics and action. Various translations are performed
using links between representations at various levels. DCG rules have LHS and RHS elements with
associated a functor (their major category) and zero or more features in a fixed a-rity positional order.
Features are slots that can be filled by terms. Terms can be variables or functional terms. Semantic
representation is based on frames. A grammatical relation has a component that triggers a translation
relation. Binding operates on the semantic interpretation of the arguments to produce the semantic
interpretation of a new phrase. In this way semantic fragments are built. DELPHI contains a linguistic
analyzer that generates the N best hypotheses using a fast, simple algorithm , and then repeatedly rescores
these hypotheses by means of more complex, slower algorithms. In this manner, several different
knowledge sources can contribute to the final result without complicating the control structure or
significantly slowing down derivation of the final result.
The first version of DELPHI used of a chart-based unification parser ( Austin et al., 1991). An important
and useful feature of this parser, which has been retained in all subsequent versions, was the
incorporation of probabilities for different senses of a word and for application of grammatical rules.
These probabilities are estimated from data and used to reduce the search space for parsing.
A robust fallback module has been incorporated in successive versions. The fallback understanding
module within DELPHI was called if the unification chart parser failed. Rather than employing the
semantic module to assign an explicit natural-language score to hypotheses, DELPHI tried to parse the
first N=10 hypotheses completely, stopping when a complete interpretation could be generated. If that
didn't work, another pass through these ten hypotheses would be made with the fallback module, which
tried to generate a robust interpretation from parsed fragments left over from the first, failed parse.
The fallback module was itself made up of two parts: the Syntactic Combiner and the Frame
Combiner. The Syntactic Combiner used extended grammatical rules that skipped over intervening
material in an attempt to generate a complete parse. If the attempt failed, the Frame Combiner tried to fill
slots in frames in a manner similar to that of SRI's Template Matcher. The Frame Combiner used many
pragmatic rules obtained through study of training data which could not be defended on abstract grounds.
For instance, interpretations which combine flight and ground transportation information are ruled out
because they are never observed in the data, even though a query like ”Show flights to airports with
limousine service” is theoretically possible.
Surprisingly, the fallback module worked better if only the Frame Combiner - but not the Syntactic
Combiner - was included.
In order to increase robustness and reduce reliance on the fallback module, a semantic graph data
structure was introduced and syntactic evidence was considered only one way of determining the
semantic links out of which the graph is built. A semantic graph is a directed acyclic graph in which
nodes correspond to meanings of head words (e.g. arrival, flight, Boston) and the arcs are binary
semantic relations. The basic parsing operation is that of linking two disconnected graphs with a new arc.
If the chart parser does not succeed in connecting such disconnected graphs, the Semantic Linker is
invoked. Fragments are lexical nodes, combination is graph completion through search, link probabilities
are derived from corpus This component can ignore fragment order, skip over unanalyzable material, and
even ”hallucinate” a new node if that's the only way to link fragments.
Semantically driven parsers use pattern matching to recognize items that match with the data. Matching
may start with lexico-semantic patterns for instantiating initial lexical items. Interpretations are built by
adding non-lexical items inferred by a search algorithm. Semantic labels can be attached to parse tree
nodes as a result of the application of a rule whose premise matches words, syntactic categories and
available or expected semantic labels.
Different types of matchers can be designed for different purposes. When the purpose is solely retrieval, a
vector of features may adequately represent the content of a message.
Different structures are required if the purpose is that of obtaining a conceptual representation to be used
for data-base access or for a dialogue whose goal is the execution of an action. Finite state pattern
matchers, lexical pattern matchers, and sentence level pattern matchers are discussed in (Hobbs and
Israel, 1994).
Unanticipated expressions and difficult constructions in the spoken language cause problems to a
conventional approach. A few types of information account for a very large proportion of utterances. A
Template matcher (TM) tries to build templates (four basic ones instantiates in various ways.
The system developed at Stanford Research Institute (SRI) consists of two semantic modules yoked
together: a unification-grammar-based module called ”Gemini”, and the TM which acts as a fallback if
Gemini can't produce an acceptable database query. Gemini is a unification-based natural-language
parser that combines general syntactic and semantic rules for English with an ATIS-specific lexicon and
sortal|selectional restrictions.
Templates have slots filled by looking for short phrases produced by the recognizer even if not all the
words have been hypothesized correctly. Instantiation scores are basically the percent of correct words.
The template with the best score is used to build the query.
The input to the TM is the top word sequence hypothesis generated by the speech recognition
component, which uses a bigram language model. The TM simply tries to fill slots in frame-like
templates. An early version had just 8 templates dealing with flights, fares, ground transportation,
meanings of codes and headings, aircraft, cities, airlines and airports. The different templates compete
with each other on each utterance; all are scored, and the template with the best score generates the
database query (provided its score is greater than a certain cut-off). Slots are filled by looking through the
utterance for certain phrases and words.
Here is a typical example. For the utterance :”Show me all the Delta flights Denver to Atlanta nonstop on
the twelve of April leaving after ten in the morning”, the following flight template would be generated:
[flight, [stops, nonstop],

[airline, DL],
[origin, DENVER],
[destination, ATLANTA],
[departing_after,[1000]],
[date, [april,12,current_year]]
].
The score for a template is basically the percentage of words in the utterance that contribute to filling
the template.
Early experiment is SLU made it clear the necessity of analyzing portions of a sentence when the
complete sentence could not be analyzed. Problems of this type may be due to the fact that spoken
language very often does not follow a formal grammar, hesitations and repetitions are frequent and
available parsers do not ensure full coverage of possible sentence even in the case of written text.
More details and references can be found in ([22], ch. 14).
5. FINITE STATE PROBABILISTIC MODELS FOR INTERPRETATION
Even if there are relations between semantic and syntactic knowledge, integrating these two types of
knowledge into a single grammar formalism may not be the best solution. Many problems of
automatic interpretation in SLU systems arise from the fact that many sentences are ungrammatical,
the ASR components make errors in hypothesizing words and grammars have limited coverage. These
considerations suggest that it is worth considering specific models for each conceptual constituent.
In addition to partial parsing [51] and back-off, in the Air Travel Information System (ATIS) project,
it was found useful to conceive devices for representing knowledge whose imprecision is
characterized by probability distributions. It was also found useful to obtain model parameters by
automatic learning using manually annotated corpora. This works as far as manual annotation is easy,
reliable and ensures a high coverage.
The SLU system architecture evolved according to the scheme shown in Figure 5. Figure 6 shows the
type of KSs mostly used in the used in the 90ies.
The modular architecture is based on knowledge and automatic learning capability. Different types of
SLU knowledge are reviewed in this and the following sections.
Stochastic finite-state approximations of natural language knowledge are practically useful for this
purpose. Finite-state approximations of context-free grammars are proposed in [89]. Approximations of
TAG grammars are described in [97]. A review of these approximations is provided in [28].
Let assume that a concept C is expressed by a user in a sentence W which is recognized by an ASR
system, based on acoustic features Y. This can be represented as follows: Y →e W →e C. Symbol →e
indicates an evidential relation meaning that if Y is observed then there is evidence of W and, because
of this, there is evidence of C.
There are exceptions to this chain of rules, because a different concept C’ can be expressed by W and Y
can generate other hypotheses W’ which express other concepts. Furthermore, C can be expressed by
other sentences W j which can be hypothesized from Y. The presence of C in a spoken message
described by Y can only be asserted with probability:
1 ⎡ ⎤
P(C Y ) ≈ ⎢∑ P(Y Wj )P(CWj )⎥
P(Y) ⎣⎢ j ⎦⎥
Let assume now that C is a sequence of hypotheses about semantic constituents, the following
decision strategy can be used to find the most likely sequence C’ as follows:
C' = arg max P(C / Y) = arg max P(Y / W ) P(CW )

C C
Word hypotheses are generated by an ASR system using a probabilistic language model (LM).
learning
AM LM trans KS structural KS
speech
speech to MRL constituents constituents to structures

words concept tags
Short Term Memory
dialogue
Figure 5 – Architecture with translation and structural KSs
ASR KS corpus
speech learning
SFSM
ASR S control n-grams
matchers SCFG
words, parsers meaning
lattices translators
SLU
Figure 6 - Architecture with stochastic knowledge and automatic learning capability
A solution based on the above introduced concepts is implemented in the system called Chronus [90].
The core of this system is a stochastic model whose parameters are learned from a corpus in which
semantic constituent are associated to sentence chunks. The conceptual decoder at the core of Chronus
is based on a view of utterances as generated by an HMM-like process whose hidden states
correspond to meaning units called concepts. Thus, understanding is a decoding of these concepts
hidden in an utterance. In the Chronous system, the probability P(CW) is computed as follows.
P(CW)=P(W|C)P(C)
P(C) is obtained with concept bigram probabilities.
A version of Chronus obtained the best score on the 1994 natural language (NL) benchmark. It was based
on the following principles:
• locality - the analysis of the entire sentence is delayed as long as possible,
• learnability - everything that can be learned automatically from data should be,
• patchability - it should be easy to introduce new knowledge into the system,
• separation - among algorithms, and between general and specific knowledge,
• habitability - the focus should be on robustness to unexpected non-linguistic
phenomena and recognizer mistakes, rather than on dealing with rare, complex linguistic events.
The success of this system is in some respects surprising, given that the conceptual decoder chops an
utterance up into non-overlapping segments, which to a first approximation are considered to contribute
to the meaning independently of each other (interactions are handled by the ”interpreter”, a small hand-
coded module, at a later stage of processing). A later version of Chronus has four main modules: the
lexical analyzer, the conceptual decoder, the template generator, and the interpreter. The input to the
lexical analyzer is the top hypothesis generated by the recognizer. The lexical analyzer recognizes
predefined semantic categories, which group together all possible idiomatic variants of the same word or
fixed phrase: for instance, ”JFK”, ”Kennedy Airport”, ”Kennedy International Airport”, ”New York City
International Airport” are all assigned to the same semantic category. The lexical analyzer also groups
together singular and plural forms of a word, and inflectional variants of a verb, thus achieving
robustness to minor speech recognition errors.
The conceptual decoder views the modified word sequences emerging from the lexical analyzer as
conceptual Hidden Markov Models (conceptual HMMs), with the words being the observations and the
concepts being the states. Concept sequences are currently modeled via a bigram language model, and the
sequence of words within a concept is modeled as a concept-dependent N-gram language model. The
function of the conceptual decoder is to segment an utterance into phrases, each representing a concept.
This is equivalent to finding the most likely sequence of states in the conceptual HMM, given the
sequence produced by the lexical analyzer.
The choice of conceptual units is a domain-dependent design decision. For ATIS, some concepts relate
directly to database entities (e.g., ”destination”, ”origin”, ”aircraft_type”) and others are more linguistic
(e.g., ”question”, ”dummy” - for irrelevant words, and ”subject” - what the user wants to know). Once
these units have been defined, the parameters of the conceptual HMM must be estimated from a training
corpus of segmented, labeled word sequences by means of the Viterbi training algorithm for HMMs.
This process can be bootstrapped. A typical output from the conceptual decoder might look like this:
wish : I WOULD LIKE TO GO
origin : FROM NEW YORK
destin : TO SAN FRANCISCO
day : SATURDAY
time : MORNING
aircraft : PREFERABLY ON A BOEING SEVEN FORTY SEVEN
The following frame is then obtained by hand-written rules:
ORIGIN_CITY : NNYC
DESTINATION_CITY : SSFO
WEEKDAY : SATURDAY
ORIGIN_TIME : 0<1200
AIRCRAFT : 74M
More details and references can be found in [22].

Examples of learning algorithms for finite state transducer can be found in [91].
The CHANEL system [60], performs the following computation:
P(CW)=P(C|W)P(W)
CHANEL learns semantic interpretation rules by means of a forest of specialized decision trees called
Semantic Classification Trees (SCTs). The required annotation only consists in listing the concepts
present in a sentence.
There is an SCT for every elementary concept. An SCT is a binary tree with a question associated to
each node. Questions are generated and selected automatically. Each node has two successors, one is
reached if the answer to the node question is YES, the other node is reached if the answer is NO.
Questions are about sentence patterns made of words and wildcard symbols (+). If the node pattern
matches with the sentence to be interpreted, then the answer to the node question is YES and the
successor node pointed by the arc labeled YES is considered, otherwise the answer is NO and the
corresponding successor node is considered. Figure 7 shows an example of a tree for the concept fare.
An example of question pattern is < + M(fare) + > that matches with a sentence containing any word
member of the member set M(fare) of words expressing the same meaning as fare.
+ fare +
YES
NO
+ M(fare) +
+ fare code +
YES Y
NO NO
Y
flights + fare + + cost +
YES
N
NO NO
YES
subtree subtree
Y
Figure 7 - Example of an SCT
The nature of the questions in the SCTs is such that the rules learnt are robust to grammatical and
lexical errors in the input from the recognizer. In fact, these questions are generated in a manner that
tends to minimize the number of words that must be correct for understanding to take place. Question
generation involves ”gaps”: words and groups of words that can be ignored. Thus, each leaf of an SCT
corresponds to a regular expression containing gaps, words, and syntactic units (e.g., times, dates,
airplane types). Most SCTs in CHANEL decide whether a given concept is present or absent from the
semantic representation for an utterance; for such SCTs, the label Y or N in a leaf denotes the presence
or absence of the corresponding concept.
If one generalizes away from the domain-specific details of CHANEL, one can give the following
recipe for building a CHANEL-like system.
1. Collect a corpus of utterances in which each utterance is accompanied by its semantic

representation.
2. Write a local parser that recognizes semantically important noun phrases that encode variables in
the semantic representation (e.g., times, locations) and replaces such phrases with a generic code
(while retaining a value for each variable). For instance, a time might be replaced by the symbol
TIME, and a city name by the symbol CITY. Thus, the utterance ”give me all uh ten at night
flights out of Boston” might become ”give me all uh TIME[22:00] flights out of CITY[Bos]”.
3. Devise a way of mapping the rest of the semantic representation (i.e., the part that does not consist
of the variables just mentioned) into a vector of N bits. For example, CHANEL had ”fare” bit
that was set to 1 if the user wanted to know the cost of a flight, and to 0 otherwise. Some bits are
allocated to deciding the role of variables - e.g., to deciding whether the CITY in ”give me all uh
TIME flights out of CITY” should be an origin or a destination.
4. Grow N SCTs, one for each position in the bit vector. The training data for each SCT is the whole
training corpus of utterances after processing by the local parser (with variable values stripped
out); the label for each utterance is the value of the appropriate bit. E.g., for CHANEL two typical
training utterances for the fare SCT might be: ”give me all uh TIME flights out of CITY” => 0
”how much are flights to CITY these days” => 1.
5. Given a new utterance, one can generate a semantic representation from the resulting system as
follows:
• pass the utterance through the local parser,
• temporarily strip out variable values (saving them for later use) and submit the resulting string
to the N SCTs (each SCT receives a complete copy),
• the resulting vector of bits, together with saved variable values, gives a unique semantic
representation for the utterance.
Probability P(C|W) is obtained from the counts of times the leaf corresponding to the pattern that
matched with W is reached. Notice that W can be an entire sentence. Different concept tag hypotheses
can be generated by different sentence patterns that share some components.
Good parsers for semantically important noun phrases can be hand-coded quite quickly; implementing
machine learning of the rules in these parsers would have been more trouble than it was worth and was
avoided;
The following is an example of the semantic representation generated by CHANEL:
DISPLAYED_ATTRIBUTES (flights, fares)

CONSTRAINTS (flight_from_airport <- BBOS
flight_to_airport <- DDEN
flight_departure_time <- 10.00AM ).
A scheme of the CHANEL system architecture is shown in Figure 14.3. Elementary concepts are
detected by SCTs, then they are composed by rules into an MRL description and translated into an
query to a data base in the SQL language.
The most interesting aspect of CHANEL is that the inference carried out by the SCTs explicitly
models don't care words, allowing the system to tolerate a high degree of misrecognition in
semantically unimportant words. The literature on grammatical inference (Fu 1982) focuses on
production rather than comprehension, and thus implicitly assumes that the goal is to learn rules that
account for all the symbols in a given string. SCTs instead try to discover rules involving as few words
or syntactic units as possible.
There is another important difference between CHANEL and systems such as AT&T's Chronus.
Chronus carries out a one-to-one mapping between sentence segments and concepts, while each of the
SCTs in CHANEL builds part of the semantic representation, and looks at the entire utterance in order
to do so. This permits a given word or phrase to contribute to more than one concept, and also permits
words or phrases that are far apart from each other to contribute to the same concept.
In [27], it is proposed to extract concept hypotheses from a word lattice. Each concept hypothesis is
extracted with a probabilistic conceptual semantic context-free grammar.
6. CONCEPTUAL LANGUAGE MODELS
Specific conceptual language models can be used in ASR decoding to obtain concept hypotheses
directly from the signal rather than from word hypotheses..
In [149] it is observed that probabilities of finite-state language models can be adapted to expectation
set by the history of already uttered words or states of a spoken dialogue.
A dialogue can be seen a sequence of turns. Let assume that, at turn τ, the dialogue is in a state S τ . S τ is
determined by the values of the variables which are bound at turn τ. Specialized dialog state dependent
LMs are considered. They are obtained by adapting a generic LM to a specific situation in which
constituent concepts have a probability to appear in a future state S τ+1 .
Let PS ( w i Wii−−n1+1 ) be the n-gram probability distribution for an LM dependent on state S= S τ . Let
Pj ( w i Wii−−n1+1 ) be the probability distribution of a conceptual LM corresponding to the semantic

constituent c j . Let PS (c j ) be the probability that concept c j is expressed by a subject speaking to a
system in state S. Then:
J
PS ( w i Wii−−n1+1 ) = ∑ PS (c j )P j ( w i Wii−−n1+1 )
j=1
If now we assume that, at turn τ, the dialogue can be in different states S l with probability P(S l ) and
that at turn τ+1 the dialogue can be is different states S k . For each transition S l -> S k it is possible to
estimate the probability Pl ,k (c j ) that a semantic constituent is expressed during that transition. It is
possible to conceive a dynamic LM with the following probability distribution at turn τ:
L K J
Pτ ( w i Wii−−n1+1 ) = ∑ ∑ ∑ Pl ,k (c j )Pj ( w i Wii−−n1+1 )
l =1 k j=1
Based on the future state S k and the action a τ , executed by the system at turn τ, it is possible to identify
the concepts that are more likely to be expressed by the user response. Based on this, the following
approximations can be considered for probability Pl,k (c j ) :
Pl,k (c j ) ≈ P(S k a τSl )P(a τ Sl )P(Sl ) ≈ P(S k Sl )P(Sl )
Another possibility is to have a generic n-gram LM and specific stochastic finite-state machines (FSM),
one for each semantic constituent c j .
An example of LMs based on stochastic FSM can be found in [95]. Variable N-gram Stochastic
Automata (VNSA) and their use for hypothesizing semantic constituents are proposed in [101].
In [83], weighted finite state machine (WFSM) are proposed whose edges are labelled with words. A
path in the WFSM represents a phrase. Word n-grams and WFSMs can be combined and regarded as a
hidden Markov model (HMM).The model construction starts with sentence parsing. The first step of the
construction consists in partially parsing a training corpus in order to recognize sequences of words as
phrases. The training corpus is first annotated with part of speech (POS) tags. At the end of this process,
each word of the corpus is associated to its most probable part of speech. The annotated corpus is then
partially parsed using a greedy finite state partial parser. The parser gathers together adjacent words
composing a phrase of a given type (noun phrase, verb phrase..). Different grammars are used to
recognize phrases of different nature and length. The second step is the construction of phrase classes.
It consist in grouping together into classes phrases of the same category. The third step consists in
merging together classes having a close internal distribution. An example of WFSM is shown in Figure
8.
Figure 8 – Example of WFSM.
Finite state models can be made more robust by modifying the original topology to take into account
possible insertions, deletions and substitutions. Insertion of words not essential for characterizing a
semantic constituent can be modeled by groups of syllables [21] as shown in Figure 9.
Recent advances in research on stochastic FSM made it possible to generate a probabilistic lattice of
conceptual constituent hypotheses from a probabilistic lattice of word hypotheses.
In [99]; a stochastic finite-state conceptual language model CLM j is conceived for every semantic
constituent. An initial ASR activity uses a generic LM, indicated as GENLM, for generating a graph
WG of word hypotheses. An automaton AWG is derived for this graph. A sequence W of word
hypotheses is scored by its likelihood .
Figure 9 – Example of error modeling in FSMs.
A knowledge source, is built by connecting all the CLM j in parallel. Such a knowledge source is
composed with WG leading to an automaton SEMG:
⎛ C ⎞
SEMG = WG o ⎜⎜ U CLM c ⎟⎟
⎝ c=0 ⎠
operator oindicates composition. CLM 0 is a generic model. A network implementing the union of
CLMs is shown in Figure 10.
C LM 0
C LM 1
… … … … … … … … … … … … ..
C LM j
C LM J
Figure 10 - Network implementing the union of CLMs

Arcs of SEMG are labelled by pairs of symbols. The first symbol of the pair is a word w with associated
its likelihood. The second symbol of the pair can be the empty symbol, the beginning of a semantic tag
or the end of a semantic tag. A semantic tag represents any semantic constituent or structure for which a
relation with a word pattern has been identified.
The support of a concept c j , sup( c j ), is the union of all the paths going from the beginning to the end of
WG. Supports for different concepts can overlap. Figure 11 shows an example of WG.
Figure 11 – Example of WG
In order to obtain the concept tags representing hypotheses that are more likely to be expressed by the
analyzed utterance, SEMG is projected on its outputs leading to a weighted Finite State Machine (FSM)
containing only indicators of beginning and end words of semantic tags. The resulting FSM is then
made deterministic and minimized leading to an FSM SWG given by:
SWG=OUTPROJ(SEMG)
where OUTPROJ represents the operation of projection on the outputs followed by determinization and
minimization.
A network of conceptual LMs has been used directly in the ASR decoding process [21]. The whole ASR
knowledge models in this way a relation between signal features and meaning.
Conceptual hypotheses in the lattice obtained by this projection can be further processed for performing
semantic composition and inference. The objective is to have the correct interpretation as the most likely
among the hypotheses coherent with the dialogue predictions.
Generation of lattices of conceptual constituents (tags) from lattices of words follows the scheme shown
in figure 12.
AM LM tr KS
ASR transl
word tag
lattice lattice
Figure 12 – Architecture for the generation of lattices of conceptual constituents (tags) from lattices of
words.
In [53], an automaton extracts key phrases from continuous speech and converts them to commands for
a multi-modal interaction with a virtual fitting room. Finite state LM for interpretation are discussed in
[92], Interesting results can also be found in [138]. Integration of semantic predictors in statistical LMs
is proposed in [16].
LMs based on Latent Semantic Analysis (LSA) capture some semantic relationship between words.
LSA maps the words and histories into a semantic space using Singular Value Decomposition (SVD)
technique [6]. Word similarities are measured with distance metrics such as the inner product between
vectors. A similar technique was proposed for hypothesizing semantic components in a sentence [15].
A solution with which relevant improvements were observed in large corpora experiments is proposed
in [130]. Super abstract role values (superarv) are introduced to encode multiple knowledge sources in a
uniform representation that is much more fine-grained that parts of speech (POS).
In [121], a hierarchy of LMs is proposed for interpretation. The introduction of three new ways to use
semantic information in LMs is presented in [28].
The introduction of three new ways to use semantic information in LMs is presented in [28].
Finite state models are used to obtain a concept LM score which is interpolated with the n-gram LM
score. In a second approach, semantic parse information is combined with n-gram information using a
two-level statistical model. In the third approach, features are used for computing the joint probability of
a sentence and its parse with a single maximum entropy (ME) model.
Multiple features are combined as follows to obtain the probability of a word w in a vocabulary V, given
its history h:
∑ λifi ( w ,h )
ei
P( w h ) = ∑ λ i f i ( v ,h )
∑e i
v∈V
7. STOCHASTIC GRAMMARS FOR INTERPRETATION
The rules of a grammar assert the truth of a non terminal symbol given the truth of other terminal and
non terminal symbols. The assertion of the presence of a semantic constituent or compound also
depends on the assertion of syntactic structures and words. It is thus possible, in principle, to introduce
nonterminal symbols which represent semantic entities into a natural language grammar and hypothesize
their presence in a sentence with a parsing strategy. Grammars of this type should be context-sensitive
and parsing strategies should provide inference capabilities. Nevertheless, context-free grammars or
grammars capable of representing certain degrees of context-sensitivity may be adequate for a grop of
applications. Furthermore, development of new types of grammars and parser capable of taking into
account imprecision made it attractive to integrate syntactic and semantic knowledge into stochastic
semantic grammars. Grammars may capture relations between words and semantic constituents as well
as knowledge for composing constituents into structures. These grammars can be augmented to contain
structure building knowledge and perform logic operations.
Stochastic context-free grammars (SCFG) can generate sentences of any length. Parsing these sentences
is an activity that involves the application of a finite number of rules. Sequences of their application can
be modelled by a finite state structure and the history of the rules applied before a given rule can be
summarized by finite feature sets. Sequences of rule applications and their probabilities are considered
in history grammars [8] making them a more accurate probabilistic LM.
For SLU, the linguistic analyzer TINA was proposed. It is written as a set of probabilistic context free
rewrite rules with constraints, which is converted automatically at run-time to a network form in which
each node represents a syntactic or semantic category [111]. The probabilities associated with rules are
calculated from training data, and serve to constrain search during recognition (without them, all
possible parses would have to be considered).
A robust matcher was obtained by modifying the grammar to allow partial parses [112]. In robust
mode, the parser proceeds left-to-right as usual, but an exhaustive set of possible parses is generated
starting at each word of the utterance.
The Hidden Understanding Model (HUM) is inspired by (but not formally equivalent to) Hidden
Markov Models [76]. In the HUM system, after a parse tree is obtained, bigram probabilities of partial
path towards the root, given another partial path are used. Interpretation is guided by instructions
represented by a stochastic decision tree.
Let M be the meaning of an utterance, represented by one or more semantic structures, and let W be
the sequence of words that convey this meaning. Hypotheses are scored by the following probability:
Pr(M|W) = Pr(W|M)Pr(M)|Pr(W)
For given W, the M that maximizes Pr(M|W) can be found by maximizing Pr(W|M)Pr(M), since
Pr(W) is fixed. Pr(M) can be estimated from a semantic language model that specifies how meaning
expressions are generated stochastically; Pr(W|M) can be estimated from a lexical realization model
that specifies how words are generated, given a meaning. The semantic language model employs tree
structured meaning representations: concepts are represented as nodes in a tree, with sub-concepts
represented as child nodes. Interpretation is guided by a strategy represented by a stochastic decision
tree. Each terminal node is the parent of a word or of a sequence of words. Note that unlike Chronus,
HUM allows arbitrary nesting of concepts.
For instance, the concept FLIGHT has as possible sub-concepts AIRLINE, FLIGHT_NUMBER,
ORIGIN, and DESTINATION. ORIGIN and DESTINATION have as possible children the terminal
nodes (respectively) ORIGIN_IND and CITY, and DEST_IND and CITY. In this tree structured
representation, the phrase ”United flight 203 from Dallas to Atlanta” could be analyzed as:
FLIGHT [AIRLINE[United]
FLIGHT_IND[flight]
FLT_NUM[203]
ORIGIN[ORIGIN_IND[from] CITY[Dallas]]
DESTINATION[DEST_IND[to] CITY[Atlanta]] ]
The lexical realization model is a bigram language model augmented with information about the
current parent concept: Pr(word{i}|word{i-1},concept).
Chart parsers were used to analyze portions of sentences in a middle-out strategy and to produce a
forest of sub-trees when the parser could not process an entire sentence. The problem of computing the
probability of a partial parse when a stochastic CFG is used was investigated in [19] and it was shown
that only upper-bounds for parse probabilities can be obtained.
Other examples on the use of semantic grammars can be found in [80]. Parsing word graphs is
proposed in [122].
Most grammars have hand-crafted rules which might then be augmented with corpus statistics. Parsing
with these grammars suffers from limited coverage.
At Cambridge University [42], an approach based on SCFGs was proposed which does not require fully
annotated data for training. The proposed solution considers a hidden vector state (HVS) model. Each
vector state is viewed as a hidden variable and represents the state of a push-down automaton. Such a
vector is the result of pushing non-terminal symbols starting from the root symbol and ending with the
pre-terminal symbol. Non-terminal symbols correspond to semantic compositions like FLIGHTS while
pre-terminal symbols correspond to semantic constituents like CITY.
An example of state vector representing a path for a composition to the start symbol S is:
⎡CITY ⎤
⎢FROM _ LOCATION _ ⎥
⎢ ⎥
⎢FLIGHTS ⎥
⎢ ⎥
⎣S ⎦
Transitions between states can be factored into a stack shift followed by a push of one or more semantic
concepts relating to the next input word.
Probabilities are defined for the following parse actions:
1. popping semantic tags off the stack;

2. pushing a pre-terminal semantic tag onto the stack;
3. generating the next word.
Interpretation is performed by first generating a sequence W of word hypotheses for which probability
P(Y,W) is maximum, then searching for interpretations C by maximizing P(C|W) and then searching for
dialog acts G by maximizing P(G|C). This decoder corresponds to a suboptimal solution. Once the Nbest
values for C are found then they are rescored using P(C,Y).
If N is a sequence of stack pop operations, C is the sequence of conceptual tag vectors and W the
sequence of words, then the model computes P(W,C,N) as follows.
T
P(W, C, N ) = ∏ P( n t | c t − 1 [ ] ) P(c t [1] | c t [ 2..... D] ) P( w t | c t [ ] )
t =1
where c t [ ] is the t-th vector of semantic tags obtained by successively pushing and shifting semantic
tags.
Training is performed by annotating keywords into classes using Bayesian networks and providing
examples of utterances which would yield each type of semantic schemata. Furthermore, the domain
specific lexical classes and abstract semantic annotation for each utterance are provided.
In [129] it is observed that the remarkable robustness exhibited by the auditory system may be
attributed to the use of a detection based mechanism. A new formulation is proposed that performs
concept hypothesization in conjunction with ASR decoding under the control of a SCFG.
Combination of semantic and syntactic structures in LM is proposed in [11]. Lexicalized stochastic

grammars and head-driven statistical parsers are presented in [14,18]. Partial parses are proposed in
[13,106] to enhance robustness. They use a top-down strategy, conditioning word prediction on
previously hypothesized structures. Several learning systems have been developed for semantic parsing.
These systems use supervised learning methods which only utilize annotated sentences.
In [52], a semi-supervised learning system for semantic parsing using a support vector machine (SVM),
is described. Given positive and negative training examples in some vector space, an SVM finds the
maximum-margin hyperplane which separates them. When new unlabeled test examples are also
available during training, a transductive framework for learning uses them for adapting the SVM
classifier.
8. SEMANTIC SYNTAX-DIRECTED TRANSLATION
It is possible to see semantic interpretation as a translation process from natural language sentences
into MRL phrases. A syntax-directed translation schema (SDTS) is a five-tuple T : [VN, VT1, VT2, R,
S], where S is the start symbol, VN is the set of nonterminal symbols, VT1 is the set of input words,
VT2 is the set of semantic primitives, R is the set of rules, for rewriting nonterminal symbols, of the
type:
A → αβ where : α ∈ ( VN ∪ VT1) * , A ∈ VN and β ∈ ( VN ∪ VT2) *.
With these rules, sentences and the corresponding semantic descriptions can be generated. It is also
possible to generate semantic descriptions during parsing of input sentences. Translations can be
scored by probabilities in a stochastic SDTS. In this case, the syntax and semantic generating rules
have associated probabilities with which it is possible to compute the probability that semantic rules
generate the abstract tree of an interpretation given the abstract tree of the syntactic analysis of a
sentence.
Rules for SDTS can be built manually, or they can be learned with their probabilities by grammatical
inference techniques from a corpus. As rules do not allow to generate all the possible observable
sentences, the coverage of these grammars is only partial. Interpretation of sentences that cannot be
generated by the rules is made possible by performing error-correcting parsing.
A good introduction of the formalism and its application to pattern recognition can be found in [151];
an application to speech understanding of stochastic SDTS with learning capabilities is described in
[152]. Subsequential transducer learning is described in [150].
In [86], statistical translation models are used to translate a source sentence S into a target MRL.
Let E be the sequence of English-language words in an ATIS utterance, and let F be the semantic
content of the utterance as represented in an appropriate formal language. We are interested in the
joint distribution Pr(F,E) = Pr(F)Pr(E|F). In particular, for a given E, we want to find its most probable
translation:
arg-max[Pr(F|E)] = arg-max[Pr(F,E)] = arg-max[Pr(F)Pr(E|F)]
Thus, one needs a language model Pr(F) for the semantic content of ATIS requests, and a translation
model Pr(E|F).
The central idea of the translation model is that the english sentence can be clumped into phrases and
each clump is generated by a word of MRL.
The alignment between clumps in E and units in F is hidden as well as the boundaries of each clump.
Let E = e 1 ,..., e i ,.., e LE be the English sentence made of LE words and
F = f1 ,..., f j ,.., f LF the corresponding sequence of FL clump describing symbols. Notice that there
may be more clumps aligned to the same symbol.
An alignment A is an LC-tuple describing, for each element f j , the words e i (in a contiguous order )
corresponding to it. Let a i be the element of A containing e i and f (a i ) be the corresponding element
in F. One can write:
P( E| F) = ∑ P( E, A| F) = P( LE| LF) ∏ P[e i | f (a i )]P(a i | F)

LE
A i =1
These probabilities are provided by a fertility model.
In [86] statistical translation models are used to translate a source sentence S into a target, artificial
language T by maximizing the following probability :
Pr(T|S) = Pr(S|T)P(T)
Pr(S)
The central task in training is to determine correlations between group of words in one language and
groups of words in the other. The source channel fails in capturing such correlations, so a direct model
has been built to directly compute the posterior probability P(T|S).
Binary valued features are used to characterize cooccurrence of a sequence s and a sequence t in the
two languages. Features query presence or absence of n-grams, long-distance bigrams, sets of words
both in source and target language.
The posterior probability is computed as follows:
P0 (T | S)∏ αiΦi (S,T )

P(T | S) =
Z(S)
where α i = e λ i is a multiplicative vote assigned to feature Φ i (S, T) .
This is the optimal solution for any probability distribution that satisfies the constraints and is closer to
the prior one.
The goodness of a model can be measured considering pairs of {history, future}, {h,f} events. A
measure of goodness of a model P is discrimination measured as
T P( f i | h i )
D( P ): = ∑ log ~
i =1 P{ fP ( h i )| h i }
~ ~
where fP ( h i ) is the future predicted by the model given the history. fP ( h i ) is the best model guess
based on:
~
fp ( h ) = arg max P( f | h)
f
Another measure of goodness is the likelihood of training data involving the empirical distributions :
~
∑ P( h, f ) log P( f | h)
h ,f
Given a feature set, discrimination is computed with the set of α i = e λ i that minimizes discrimination
D. Starting from a small set of features, new features are added, one at a time and ordered according to
how much they reduce D.
Interesting solutions for semantic interpretation using a machine translation approach can be found in
[69].
Sudoh and Tsukada, [117] propose a statistical NLU model that can be trained using loose
correspondence between pairs of a word sequence and a set of concepts associated at the sentence level.
Concepts are represented as attribute/value pairs.
In addition, in spoken dialogue systems it is necessary to reject erroneous concept hypotheses as well as
finding the most likely ones. For rejection, a confidence measure of concepts is necessary. A method is
proposed using a statistical translation model from words to concepts. The translation model can be
trained using the corpus with simplified concept annotation, where each sentence is aligned to a set of
concepts but each concept is not aligned explicitly to the corresponding words in the sentence.
Here, alignments between words and concepts can be automatically obtained based on cooccurrence.
The model is an N-gram-based joint probability model that can easily be integrated with existing N-
gram-based ASR engines. The confidence of SLU hypotheses can be obtained as posterior probabilities
using concept-level contexts in the form of an N-gram.
9. MODULAR SEMANTIC INTERPRETATION
Semantic interpretation involves operations of different types performing, among other things, a sort of
syntactic analysis, generation of MRL descriptions and inference. Approaches purely based on
grammars show limitations is assuring adequate coverage and ability to deal with ungrammatical
sentences, hesitations and corrections, imprecision of the ASR component.
In order to increase interpretation accuracy, it appears useful to perform different operations with
suitable modules, each using specific methods, models and strategies. Figure 13 shows the scheme of a
modular architecture.
corpus
ASR KS
speech learning
n-grams
CLMs
ASR S control grammars production
rules
logics
classifiers Sem Gr
words,
lattices Shallow meaning
parsing composition
Figure 13 – Modular architecture
Following ideas about local parsing [1], interesting results were found on the generation of semantic
constituents using finite-state models and different types of specific classifiers. Depending on the
domain complexity, constituent hypotheses can be composed into semantic structures with semantic
grammars, logical inference, and situation-action rules.
Semantic constituent hypotheses are generated with shallow semantic parsing using classifiers trained
with recent machine learning algorithms. The contribution of different interpretation features is scored
with exponential models.
Shallow semantic parsing with the goal of creating a domain independent meaning representation based
on a predicate/argument structure was first explored in detail by Gildea and Jurafsky, [33], Pradhan,
[93].
Most of the approaches to shallow parsing use features and perform classification and can be divided
into two broad classes: Constituent-by-Constituent (C-by-C) or Word-by-Word (W-by-W) classifiers
[37].
In C-by-C classification [93], the syntactic tree representation of a sentence is linearized into a sequence
of non-terminals syntactic constituents. Then, each constituent is classified into one of several
arguments or semantic roles using features derived from its respective context. In the W-by-W method,
features are obtained with a bottom-up process for each word after chunking a sentence into phrases.
In [119] a method of unsupervised semantic role labelling is proposed for large corpora. The approach
starts with “bootstrapping” by making role assignments that are unambiguous according to a verb
lexicon. Then, iteratively, a probability model is created based on the currently annotated semantic
roles. This model is used to assign roles having sufficient evidence which are added to the annotated set.
The procedure is repeated and probability thresholds are adapted until all predicate arguments have been
assigned roles. Class back-off probabilities are used when detailed probabilities cannot be reliably
estimated. Interpretation can benefit from useful collections of linguistic information.
A lexicon can be used for semantic role labelling which lists the possible roles for each syntactic
argument of each predicate. A predicate lexicon is available for FrameNet [3], and a verb lexicon is
available for PropBank [68].
VerbNet [54] specifies, for each verb class, the corresponding syntactic frames along with the semantic
role assigned to each slot of a frame.
Various feature-based methods have been proposed for identifying and classifying predicates and
arguments and for extracting relations using kernel methods and maximum entropy models [49,118].
In [140], a combination is proposed of partial parsing, also called chunking, with the mapping of the
verb arguments onto subcategorization frames that can be extracted automatically, for example, from
WordNet [75].
MindNet [103] produces a hierarchical structure of semantic relations (semrels) from a sentence using
words in a machine readable dictionary. These structures are inverted and linked with every word
appearing in them, thus allowing performing matching and computing similarities by spreading
activation.
Results in [94] with SVM classifiers have shown that there is a significant drop in performance when
training and testing on different corpora.
Committee-Based Active Learning uses multiple classifiers to select samples [113]. The concurrent
use of SCT, boosingt [109] and SVM classifiers is proposed in [100] to increase classification
robustness.
A cascade of classifiers for a two step interpretation strategy is proposed in [66]
In [109], the possibility is considered of using human-crafted knowledge to compensate for the lack of
data in building robust classifiers. The AdaBoost algorithm proposed for this task combines many
simple and moderately accurate categorization rules that are trained sequentially into a single, highly
accurate model. AdaBoost is entirely data-driven and requires an adequate amount of data for training.
A new modification of boosting is proposed that combines and balances human expertise with available
training data. The basic idea of the approach is to modify the loss function used by boosting to balance
two terms, one measuring fit to the training data, and the other measuring fit to a human-built model.
For the interpretation of written text, assigning arguments to predicates has been considered as a
tagging problem for which various supervised machine learning techniques have been proposed
[9,33,37,93]. Some of the features are the predicate, the syntactic category of a phrase and its position
with respect to the predicate, the head-world, named entities, other features of the parse tree.
In [93], the parsing problem is formulated as a multi-class classification problem and uses an SVM
classifier whose scores are converted to probabilities using a sigmoid function. For each sentence being
parsed, an argument lattice is generated. A Viterbi search is performed through the lattice combining the
probabilities computed from the SVM output with the LM probabilities, to find the maximum likelihood
path.
The issue of combining model-driven grammar-based and data-driven approaches has been considered
in [131].
A naive Bayes classifier is considered for each concept c. The input to the classifier is a vector with each
element corresponding to a key word. An element w i is set to 1 only if the word is present in the
sentence to be interpreted. The probability of a semantic hypothesis c is computed as:
P( W | c) = ∏ P[( w i = 1) c]
I
i =1
N ic + β
P[( w i = 1) c] =
N c + 2β
where N ic is the number of times word w i appear in a sentence expressing concept c and β is a
constant.
The other classifiers are SVM (one per concept). Word n-grams probabilities are also used.
Combination of classifiers is performed by a voting procedure. Important improvement were observed

by replacing certain words with their semantic categories found by a parser.
At ATT [4], a mixture language model for a multimodal application is described with a component
trained with in-domain data and another obtained with data generated by a grammar. Understanding is
the recognition of the sequence of predicate|argument tags that maximizes P(T|W) where T is the tag
sequence and W the sentence. An approximation is made by considering bigrams and trigrams of tags.
A declarative multimodal context-free grammar is used in which each terminal is a triple W:G:M,
consisting of speech (words, W), gesture (gesture symbols, G), and meaning (meaning symbols, M). The
symbol SEM is used to abstract over specific content such as the set of points delimiting an area or the
identifiers of selected objects.
The meaning is generated by concatenating the meaning symbols and replacing SEM with the
appropriate specific content.
The multimedia context-free grammar is compiled into an unweighted FSM using standard
approximation techniques. The problem of domain adaptation is also considered.
There have been a number of computational implementations of wide-coverage, domain-independent,
syntactic grammars for English in various formalisms. An example is in the Lexicalized Tree- Adjoining
Grammar (LTAG) formalism. An LTAG consists of a set of elementary trees (Supertags) (Bangalore
and Joshi, 1999) each associated with a lexical item. The set of sentences generated by an LTAG can be
obtained by combining supertags using substitution and adjunction operations. In [97], it has been
shown that for a restricted version of LTAG, the combinations of a set of supertags can be represented
as an FSM. This FSM compactly encodes the set of sentences generated by an LTAG grammar and
represents another possible interpretation component.
At IBM [107], a system is proposed which generates an N-best list of word hypotheses with a dialogue
state dependent trigram LM and rescores them with two semantic models. An Embedded context-free
semantic Grammar (EG) is defined for each concept and performs concept spotting by searching for
phrase patterns corresponding to concepts. Trigram probabilities are used for scoring hypotheses with
the EG model. Concept tags are placed at the beginning and end of the corresponding phrases in a
sequence of word hypotheses. The resulting score of a hypothesis is P(W,C). As a result, semantic
hypotheses are generated by filling a number of slots in a frame representation. Decision among these
hypotheses is made based on maximum word coverage.
A second LM, called Maximum Entropy (ME) LM (MELM), computes probabilities of a word, given
the history, using an ME model as follows.
The LM computes:
Nj
∑ λ jf j ( w i ,h i )
e i
P( w i | h i ) = Nj
∑ λ jf j ( w ',h i )
j=1
∑e
w'
w i is the current word and f i ( w i , h ) is a feature function.
With this model, it is possible to compute P(W,C). The first step for building the ME model is to
represent a parse tree obtained with EG by a sequence of words, tags and labels. The grammar rules
perform compositions of concepts too. Each element of the linear sequence of symbols is a token.
Features of the ME model are questions about n-grams of tokens , current active parent label, number of
words to the left since starting the current concept, previous word token, previous completed constituent
and number of words to the left since completion of it.
Each sentence resulting from ASR hypotheses is scored with EG and MELM to assign semantic
probability to each word. The corresponding semantic features are extracted from each word.
A decision tree (DT) is built using word features with the purpose of separating the correct and incorrect
words. DT uses the raw score of the respective features.
DT learns feature combinations for predicting acceptance or rejection. These features are used for
computing confidence measures.
Posterior probability (post) as well as EG and MELM scores are used for deciding acceptance or
rejection.
The use of classifier in spoken opinion analysis is described in [5].
10. DIALOG ACT AND SENTENCE CLASSIFICATION
A speech act is a dialogue fact expressing an action. Speech acts and other dialog facts to be used in
reasoning activities have to be hypothesized from discourse analysis. Different classifiers for speech
acts, goal and roles are proposed in [30]. Dialogue acts (DA) are meaningful discourse units, such as
statements and questions. Dialogue acts and other dialogue events, such as subjectivity expressions, are
related to discourse segments which may contain many sentences. For this reason, in order to make
statistical models for DA hypothesization it is useful to introduce features of various types, such as
lexical, segment, numerical.
An example of dialogue act is the following representation for the question: ”Does Air Canada fly from
Toronto to Dallas?” which would be expressed in predicate calculus by the expression:
TEST
(
CONNECT
(
(SUBJ AC)
(PATH
(
(ORIG TORONTO)
(DEST DALLAS)
)
)
)
).
TEST represents a possible speech act, CONNECT is a predicate corresponding to a verb. SUBJ and
PATH are cases. PATH is a function of the semantic representation of space relations that returns the
value for a space role. ORIG and DEST are the arguments of the function.
Various techniques have been proposed for DA modeling and detection. Among them, it is worth
mentioning semantic classification trees [71], Decision trees [115], hidden Markov models (HMM)
[115], fuzzy fragment-class Markov models [137], neural networks [105,115], maximum entropy
models [115].
In [105], dialog acts are hypothesized by a search process based on the Viterbi algorithm. There is an
HMM source for every dialog act DA which generates sequences of words W. The emission probability
is given by:
Pr(DA | W ) Pr( W )
Pr( W | DA) =
Pr(DA )
and the probability Pr(DA|W) is obtained by a neural network fed by words and prosodic features and
trained using the Kullback-Leibler divergence as error measure.
In [120], Pr(DA|W) is obtained from a finite-state model automatically trained using SCTs.
Words and dialogue facts can be related to query communication goals with belief networks [74].
Graphical models are proposed in [47]. The focus is on dynamic Bayesian networks. For joint
segmentation and classification of DAs, a technique based on a Hidden-Event Language Model (HELM)
is described in [142].
A more accurate event detection is obtained if sentence boundaries are identified in spoken messages
containing more than one sentence. Approaches to this task have used Hidden Markov Models (HMM)
[110] and Conditional Random Fields (CRF) [67].
Call routing is an important and practical example of spoken message categorization. In applications of
this type, the dialog act expressed by one or more sentences is classified to generate a semantic primitive
action belonging to a well defined set.
A solution to spoken message categorization is proposed in [35]. Knowledge is represented by a

network used for mapping words or phrases into actions. The network computes a score for every action
hypotheses when fed with words or phrases. Phrases are obtained with grammar fragments. A single-
layer association network is considered whose parameters are estimated with a training corpus. An
overview of early versions of the How may I help you application can be found in [36]. The application
has evolved with the introduction of new classification and learning methods.
More recent solutions for document type (and sentence type) hypothesization were proposed using
Latent Semantic Analysis (LSA) [7,15].
In [61] discriminative training is proposed for natural language call routers. In [34] a method is
proposed for estimating the LM probabilities with a criterion that optimizes end-to-end performance of a
natural language call routing system. In [92], the problem of categorical classification of actions from
speech input is investigated. A dialogue model is introduced with state-dependent LMs.
11. PROBABILISTIC LOGIC AND INFERENCE FOR SLU
In practical applications, SLU is part of a dialogue system whose objective is the execution of actions
to satisfy a user goal. Actions can be executed only if some pre-conditions are asserted true and their
results are represented by post-conditions. Preconditions for actions can be formulated in various
ways. A sound representation of them is in formal logic. Logic knowledge is usually structured.
Structures are described by an MRL inspired by computer epistemology. Preconditions for actions
depend on instances of semantic structures composed by previous dialogue actions.
The system knowledge is made of general knowledge, e.g. knowledge about dates and time, and
specific domain knowledge, e.g. the details of a telephone service. Let us call the resulting knowledge
in-domain knowledge.
As a dialogue progresses, part of the domain knowledge is instantiated. The purpose of the dialogue is
to interpret the user beliefs and goals and represent them with the MRL. Eventually, system actions
like accessing a data base, are performed to satisfy a user request. If MRL contains frames, then user
sentences should cause the instantiation of some frames, the assignment of values to some frame roles
and functions to describe them. Instantiation is based on what the user says, but also on what can be
inferred about the implicit meaning of each sentence.
When the semantic knowledge is completely expressed in a sentence, like in the ATIS corpus, it is
possible to detect semantic structures with semantic grammars and bud frame instantiating from
parsing results.
It is useful, in general, to represent compositional semantic knowledge in logic form to formulate

inferences that is useful for a system to perform. Automatic reasoning is performed with these
schemes using specific strategies
Control strategies for interpretation determine how semantic structures are built, how expectations are
defined and how knowledge structures are matched with input data in the presence of constraints and
imprecision.
There are two basic types of strategy. One is based on path extraction from a semantic or a frame
network. The other adopts a constructionist approach that can use one or more of the following
methods: inference, parsing, abduction, agenda-based formation and scoring of interpretation
hypotheses called theories.
In the constructionist approach, the meaning of a complex phrase is considered to be a function of the
meanings of its constituent parts and the way in which these parts are syntactically combined.
Reasoning is performed by programs that activate memory structures by placing activation markers on
them. Nodes of the structure are activated when the corresponding concepts are instantiated. Active
nodes may spread activation markers to hypothesize or predict the activation of concepts which have
not yet been instantiated. When two markers collide in the same node, a path is identified indicating a
possible inference. Frame-activated inference is discussed in [153].
As several markers can be propagated in parallel, a high degree of parallelism can be achieved with
these models. Massive search of syntactic patterns and coarse semantic patterns following a
hierarchical network is described in [154].
A control strategy can be called constructive if it gradually builds data structures using a basic queue
called agenda where pointers to partial interpretations, called theories, are stored in an order
dependent on the scores assigned to the theories. In [153], inference candidates are stored into an
agenda for further evaluation.
Extensions of a theory may be computed in different ways. A graph grammar may define a semantic
network with attributes attached to nodes. Attributes are constraints. A theory is the instantiation of a
partial path in the network. If constraints of a possible path extension are satisfied, they trigger rules
that build and grow paths (chunks of semantic interpretations) to be placed into an agenda. This
problem is NP-complete, but the process can be speeded up with heuristics. More on graphical
representation of concepts can be found in [148].
Other methods are theorem proving, rule chaining, abduction.
Abduction is the process of providing the best explanation of why logical expressions would be true.
Given a schema p(y) described by a schema specification language, and the rule A( y) ⊃ Q( y) , if
Q(m) is asserted to be true, then abduction would derive A(m). New assertions increase the mutual
belief of the speaker who made them and the hearer who makes abduction with them. This mode of
inference does not have a completely valid theoretical ground. Nevertheless, it is made practically
useful with heuristics represented by weights for consistency, simplicity and conciliance [155].
Early approaches to SLU used semantic representations in terms of partitioned semantic networks
[127]. Marker propagation was used for making predictions about concepts likely to appear in the
natural language messages. Concept hypotheses were generated by templates matching word and
partial parses (obtained with a best first parser) with semantic structures.
In the Hearsay II SLU architecture [29], a heterarchical architecture was used for applying rules for
matching and inference. An agenda based control strategy selects a rule whose precondition matches
the content of a blackboard. If matching is successful, then actions are performed which modify the
content of the blackboard. Production rules are used in [156,157].
The weakness of these approaches was that they did not contain an effective method for evaluating the
confidence of the generated hypotheses.
It is also useful to consider how probabilities can be computed for evaluating how likely is a conclusion
given all the available information about the user sentences. Semantic grammars do not perform
inference even if simple inferences can be performed by rules which, in some cases, can be encoded
into FSMs. Recently, in [21] a solutions is proposed for evaluating probabilities of interpretations
generated by production rules. A dialogue manager (DM) of a vocal service has a state model. A set of
states is active at turn k of a dialogue The system interprets a dialogue turn message in two phases.
In the first one, strings of words are translated into a strings of semantic constituents C. In the second
phase, a set of about 1600 precondition|action rules takes as input a set of constituents and generated a
structured interpretation Γk . The rules are ordered and the one that first matches a precondition is
applied. The result is further processed by DM.
The just outlined interpretation strategy is applied to a lattice of semantic constituents to obtain a set of I
structured interpretation Γi ,k . A word-to-constituent transducer translates a word lattice into a
constituent lattice. The precondition|action rules are also encoded as a transducer that transform concept
tag hypotheses into a rule identification number. The action of the corresponding rule is then executed.
Let assume that partial structures Γi ,k of constituents C i ,k hypothesized from a sequence of words Wi ,k
cause the transition between the successive dialogue states S l ,k −1 → S j,k . The state probability of
reaching state S j,k with partial structures Γi ,k can be approximated as follows:
[ ]
P(S j, k Γi , k Yk , S l , k −1 ) ≈ P(S j, k Γi , k S l , k −1 ) ∗ ⎧⎨ max P(Γi , k C i ,k ) P(C i , k Wi , k ) P( Wi , k Yk ) ⎫⎬
⎩Wi , k ,Ci , k ⎭
Yk is the sequence of acoustic feature description of the spoken message uttered by the user in turn k.
Notice that, in this case, the pure constituents are considered without any detected relation between
constituents and structures.
The N-best states are then processed by DM to determine the next dialogue action.
Instances of constituents can be structured into probabilistic frames. In probabilistic frame-based

systems [57], it is possible to have a probability model for a slot value which depends on a slot chain. It
is also possible to inherit probability models from classes to subclasses, to use probability models in
multiple instances and to have probability distributions representing structural uncertainty about a set of
entities.
It is shown that it possible to construct a Bayesian network (BN) for a specific instance-based query and
then perform standard BN inference.
A general method based on Petri nets for probabilistic inference on frames is proposed in [82].
Methods for probabilistic logic learning are reviewed in [23].
If different logical worlds have to be considered, then possible world probabilities have to be estimated.
The computation of probabilities of possible worlds is discussed in [84],[88] (p. 459). A general method
for computing probabilities of possible worlds based on Markov logic networks (MLN) is proposed in
[104].
12. SEMANTIC CONFIDENCE
Current state-of-the-art speech recognition and understanding systems make errors that have to be
identified in order to apply appropriate strategies for perfuming communicative and system actions,
such as error correction and repair dialogs.
In ASR, the decoding strategy finds a sequence of words which has the maximum posterior probability
of being conveyed by the speech signal. For SLU, it is more important to extract the meaning of a
sentence. The meaning may have relations with the context in which the sentence is uttered. Different
components for the meaning may share some words and are not necessarily in competition. It is useful
to estimate the probability that a concept hypothesis is correct given any evidence that can support it,
rather than just considering the acoustic features of a sentence.
The posterior probability P(Γ Y) , where Y is a time sequence of acoustic features is not the best reliability
indicator for a hypothesis [116]. In fact, acoustic, lexical, language and semantic models introduce
various degrees of imprecision. Furthermore, suitable confidence indices should also take into account
information that is not coded in Y, such as the coherence of the available hypotheses with the entire
dialogue history, including system prompts and repairs.
It is important to design algorithms for computing the probability P(Γ Φ conf ) that an interpretation Γ is
correct given Φ conf which represents a set of confidence indicators or functions of them.
In [12], important issues related to confidence metric for ASR and SLU are discussed. They refer to the
identification of errors and confidence features, feature combination and use, evaluation.
Confidence measures for ASR are reviewed [41]. Confidence measures for in SLU are proposed in [85].
Confidence measures for ASR and SLU are reviewed in [32,48].
The majority of the approaches share two basic steps:
• generate as many features as possible based on the speech recognition and|or natural language
understanding process,
• estimate correctness probabilities with these features
Typically, confidence measures depend on the particular application and its domain.
Using the posterior probabilities, obtained with acoustic and language models, of words supporting the
interpretation [64], the probability that a conceptual structure can be evaluated [58].
Confidence models are proposed for confidence scoring. A confidence model provides scores for word
and concept hypotheses based on training data. Hazen et al. [21] propose two levels of features to train
confidence models for words. They are word-level features that focus only on the reliability of acoustic
samples, and utterance- level features that concern the appropriateness of the whole utterance in which
the word is found. Assumption is made that if the whole utterance is unreliable, then the word contained
in that utterance is likely to be incorrect.
Lin and Wang [65] propose a concept-based probabilistic verification model, which also exploits
concept N-grams.
In order to achieve more accurate scoring depending on the context, in [93] it is proposed to create
confidence models for semantic frames using previous system prompts in addition to the features
obtained from the speech recognition results.
Among the methods for fusing confidence scores it is worth mentioning Fisher linear discriminant
analysis [50], decision trees, neural networks and SVM [141].
In [100] an interpretation strategy implemented by a decision tree is proposed. At a node of the tree, a
decision unit DU j is applied. The unit performs the computation of (or simply uses already computed)
confidence measures or other types of features about the content of the N-best list obtained with a word
graph WG: NB(WG) or the context ctx. A function of these features, Fj [NB( WG ), ctx, Γ], used by DU j ,
returns a set of confidence and context descriptors for one or more conceptual constituent hypotheses Γ .
DU j then computes the probability
{ }
P Γc Fj [NB( WG ), ctx, Γ]
where Γc indicates that the hypothesis Γ is correct.
Following ideas proposed for committee based active learning [113], some sesemantic confidence
indicators are based on the agreement of semantic interpretations obtained by different methods using
FSMs, and classifiers of the type SCT, SVM and boostexter following the scheme of figure 14.
Fusion strategy
FSM SCT SVM adaboost

Figure 14 - Confidence indicators based on the agreement of semantic interpretations obtained by
different methods
These classifiers are completely independent and are applied to the recognition results obtained with a
trigram LM and provide results which may partially or totally confirm the results obtained with concept
dependent LMs. However, classifiers can also provide contradictory results and perform, in some cases,
error corrections because they are trained with ASR results. In general, classifiers are trained from
labelled examples and make decisions based on features which are automatically selected. Features can
describe properties of words or abstractions of word patterns, while LMs deal essentially with sentence
patterns and word sequences. Results obtained with classifiers and with the application of LMs can be
combined for obtaining improved interpretations or semantic confidence indicators.
In [108], both word, and concept level confidence annotations are considered. Two methods are
proposed that use two sets of statistical features to model the presence of semantic information in a
sentence. The first relies on a semantic tree where node and extension scores are used. Scores are based
on the assumption that sentences that are grammatically correct and likely to be free of recognition
errors tend to be easier to parse and should receive high confidence. The second technique is based on
joint maximum entropy modelling words of a sentence and the semantic parse tree. Different maximum
entropy techniques are used to combine semantic and lexical information features depending on the type
of parsing performed. Lattice based posterior probabilities are combined with semantic features in a
probabilistic framework for each word or concept and dialog state information.
Word lattices can be further processed and formatted into a Confusion-Network (CN) structure. In [70],
an algorithm for the generation of confusion networks (CN) has been proposed. An alternative CN
generation algorithm has been proposed in [40].
Speech recognition systems encounter more difficulties when trying to recognize short words as
compared to longer words. In a word lattice, the ASR system tends to generate a large number of
hypotheses of the same word with different lengths, start frames and acoustic scores. It is frequent that
word hypotheses having significantly different time lengths are grouped in the same class, with short
words becoming possible alternatives to much longer words. Following these observations some
modifications suitable for confidence evaluation in SLU were proposed in [77].
Error correction is proposed in [98].
In [56], probabilities P Re l (c i , c j ) of the relations between instantiations of concepts in the same spoken
sentence are defined and related to the mutual information of constituent hypotheses in a sentence:
A confidence measure for an hypothesized concept constituent c i is defined in term of the mutual
information with the other n concept hypotheses c j in the sentence as follows:
where Rel indicates a relation between constituents c i and c j .

The introduction of a ”null relation” for each generic concept enables to extend the definition of the
semantic relation confidence measure to the special case of an isolated concept in an utterance.
Some concepts are more frequently observed as single hypotheses that other concepts which explains
the choice of estimating a semantic relation confidence measure for such concept hypotheses.
This confidence measure is combined with others using logistic regression and decision trees.
These methods are compared using the relative reduction of the cross entropy for a given test set:
The cross entropy, for a test set of N concepts c i , is defined as:
δ(c i ) is an indicator, equal to 1 if the hypothesis c i is correct, equal to 0 otherwise. p i is the a posteriori
probability that the concept c i is correct. With no confidence measure, pi is the same for all hypotheses
and is equal to the precision Prec on the set. Hence, the initial cross entropy is :
When a confidence mesure CM is introduced, the a posteriori probability becomes the posterior
probability that the hypothesized concept is correct, given its associated confidence measure. The
resulting cross entropy is :
The evaluation criterion for a confidence measure is then the relative diminution of cross entropy
induced by the introduction of the confidence measure:
Experimental results on a French financial information system show the efficiency of the logistic
regression method to combine measures using different knowledge sources. Improvements brought by
each confidence measure separately are almost added, which proves that the two confidence measures
complement each other. The semantic relation confidence measure gives better results when it is
integrated in a decision tree; the relative cross entropy reductions are until 4 times bigger.
Confidence scoring, has been applied to detect errors in intention recognition results and has proved
useful for dialogue management [24,58]. If the detection is successful, the system can safely avoid
unnecessary confirmations for reliable slots and put high priority in asking questions about unreliable or
unfilled ones. In [43] it is proposed to incorporate discourse features into the confidence scoring of
intention recognition results. A number of discourse- related features (called discourse features) are
introduced that characterize the contextual adequacy of slot values
Pragmatic analyses to score concepts uttered by the user is proposed in (Ammicht et al., 2001).
Assumptions are made about concepts the user should utter after a system response and used in rules to
score the incoming concepts and rescore hypothesized concepts.
Prosodic features such as F0, the length of a pause preceding the turn, and the speaking rate are
proposed in [158] to detect misrecognized user turns in spoken dialogue corpora.
In [96], multiple candidate hypotheses from different sources (e.g. deep syntactic parsing and shallow
topic classification) are evaluated and assigned overall confidence scores using features at multiple
levels (e.g. acoustic, semantic and context-based).
A discourse coherence measure [63], based on topic consistency across consecutive utterances is
obtained with interutterance distance based on the topic consistency between two utterances The
confidence measures are incorporated into the utterance verification framework by combining them in
the computation of an overall posterior probability.
13. RECENT RESULTS IN ADAPTIVE LEARNING FOR SLU
Knowledge for SLU is imprecise and incomplete. Once an application is deployed, many errors can be
ascribed to SLU knowledge imprecision.
It is useful to adapt systems to fast variations in feature statistics and learn new events with minimum
supervision. Instead of assuming a fixed and given training data as in the passive learning used in the
approaches reviewed so far, in adaptive learning samples are dynamically determined with automatic
methods.
Methods for adaptive learning are active learning, unsupervised learning and their combination.
Part of errors due to SLU knowledge imprecision can be detected by introducing suitable confidence
indicators. The corresponding messages can then be used as samples for updating SLU knowledge.
Such an activity is known as active learning.
Approaches to active learning rely on two basic method types: certainty-based and committee-based
methods. An initial system is developed with certainty-based methods using a small set of annotated
examples [17]. Such a system is used for interpreting unannotated examples. Confidence indicators are
obtained for these examples and the examples with the lowest certainties are proposed to human
labelers for annotation. Committee-based methods consider a set of classifiers trained with a small set of
annotated examples [20]. A new set of unannotated instances is presented to the classifiers. The samples
for which different classifiers provide the most different interpretations are selected for human
inspection and annotation.
Applications of certainty-based learning to sentence classification are proposed in [38].

With committee-based learning better results were obtained for sentence classification using SVM and
Boostexter classifiers [124].
A committee-based method, which is applicable to multiview problems (i.e., problems with several sets
of uncorrelated attributes that can be used for learning is co-testing [79]. In co-testing, the committee of
classifiers is trained using different views of the data.
In [39], a method is proposed in which a bootstrapped model is built with selected samples of relevant
text obtained by transcriptions from conversational systems and data retrieved from web sites. The
boostrap model is updated by an iterative process which combines unsupervised and active learning.
Unsupervised learning involves decoding followed by model building. This is implemented by co-
training with the assumption that there are multiple views for classification. Multiple models are trained
using the views. Unlabelled data are classified with all the models. The training set of a classifier is then
composed using other classifier’s predictions. A confidence score is computed for active learning and
used to select utterances for manual annotation.
In [102], an active learning method is proposed based on selective sampling and error rate prediction
as function of the training examples.
Interpretation model adaptation is proposed in [125].

A multitask learning method is presented in [126] for natural language intent classification. The already
labelled data are reused across applications while training so that collaboration among methods
improves learning results.
LM adaptation to the prediction of concepts is proposed in [59]. Discriminative training of acoustic and
language models using the Maximum Mutual Information (MMI) or Minimum Classification Error
(MCE) criteria have been used for language model adaptation in spoken dialogue systems. The learning
objective in SLU systems is to minimize concept error rate which does not reduce to minimizing word
error rate which has been the objective of previous MMI and MCE applications [132]..
14. CONCLUSIONS
History of SLU has shown an evolution from the use of high precision, non-probabilistic, low coverage
semantic human-crafted KSs to the introduction of modular, complex, probabilistic KSs some of them
obtained with automatic learning using manually annotated corpora.
In the future, it will be interesting to consider a more careful evaluation of cost and performance of
manual annotation vs. manual composition of KSs. Modular architectures make it possible the
cooperation of KSs obtained with these types of approaches, making an effective use of human
linguistic knowledge, machine learning algorithms, linguistic resources, available data with optimal or
sub-optimal decision strategies that use system capabilities to assess the confidence of the
interpretations they produce.
In spite of the imprecision of the modules used in the SLU chain, it is possible to develop useful
applications in limited domain. Thanks to effective confidence indicators, SLU results can be evaluated.
Specific dialog actions can be performed when confidence is not high. By switching to a human
operator when verification is not satisfactory, it is possible to achieve, in some cases, good automation
rates.
Improvements in models and strategies, thanks to incremental earning and the availability of more
accurate models, will increase the automation rate of existing applications and make it possible to
develop new applications in more complex domains.
15. ACKNOWLEDGEMENTS
This work was supported by the European Union (EU), Project LUNA, IST contract no 33549.
16. REFERENCES
1. P. Abney. (1991) Parsing by chunks. In R. C. Berwick, S. P. Abney, and C. Tenny, eds, Principle-
Based Parsing: Computation and Psycholinguistics, pp. 257-278. Kluwer, Dordrecht.
2. J. Allen (1987). Natural language Understanding, The Benjamin/Cummings Publishing
Company, Menlo Park CA.
3. C. Baker, C. Fillmore, and J. Lowe. 1998. The Berkeley Framenet project. COLING-ACL- 1998.
4. S. Bangalore and M. Johnston (2004) Balancing data-driven and rule-based approaches in the
context of a multimodal conversational system. HLT-NAACL, pp. 33-40
5. F. Béchet , G. Damnati, F., N. Camelin, R. De Mori (2006) Spoken opinion extraction for
detecting variations in user satisfaction IEEE/ACL Workshop on Spoken Language technology,
Aruba.
6. J. R. Bellegarda, (2000) Large vocabulary speech recognition with multi-span statistical language
models”, IEEE Trans. on Speech and Audio Processing, vol. 8, no. 1, pp. 76-84, Jan. 2000.
7. J.R. Bellegarda and K.E.A. Silverman (2001) Data-driven semantic inference for unconstrained
desktop command and control.Eurospeech, Aalborg, Denmark, pp.455-459
8. E. Black, F. Jelinek, J. D. Lafferty, D. M. Magerman, R. L. Mercer, S. Roukos (1993) Towards
History-Based Grammars: Using Richer Models for Probabilistic Parsing. ACL, p. 31-37
9. D. Blaheta and E. Charniak. 2000. Assigning function tags to parsed text. NAACL, pp. 234–240.
10. R. Bobrow; R. Ingria and D. Stallard (1990). Syntactic and semantic knowledge in the DELPHI
unification grammar. Proc. Speech and Natural language Workshop : 230-236, Hidden Valley,
PA, Morgan Kaufmann Inc., Palo Alto, CA.
11. R. Bod (2000) Combining Semantic and Syntactic Structure for Language Modeling. ICSLP,
Beijing, China
12. L. Chase (1997) Error-responsive feedback mechanisms for speech recognition. Ph.D. thesis,
Carnegie Mellon Univ., Pittsburgh, PA,USA
13. E. Charniak, (2001), Immediate-head parsing for language models. ACL, pp. 116–123.
14. C. Chelba and F. Jelinek (2000) Structured language modeling. Computer Speech and Language,
12-4:283-332
15. J. Chu-Carroll and B. Carpenter (1999) Vector-based natural language call routing,
Computational. Linguistics 25, (3), pp. 361–388.
16. N. Coccaro and D. Jurafsky (1998) Towards better integration of semantic predictors in statistical
language modelling. ICSLP, pp. 2403-2406.
17. D. Cohn, L. Atlas, and R. Ladner, (1994) Improving generalization with active learning, Machine
Learning, vol. 15, pp. 201–221,
18. Collins, M. (1999) Head-Driven Statistical Models for Natural Language Parsing. PhD thesis,
University of Pennsylvania, Philadelphia, PA, USA.
19. Corazza, R. De Mori, R. Gretter, . and G. Satta (1994). Optimal Probabilistic Evaluation
Functions for Search Controlled by Stochastic Context-free Grammars. IEEE Trans. on Pattern
Analysis and Machine Intelligence,16 (10) : 1018-1027.
20. I. Dagan and S. P. Engelson,(1995) Committee-based sampling for training probabilistic
classifiers. 12th Int. Conf. Machine Learning, pp. 150–157
21. G. Damnati, F. Bechet, R. de Mori, (2007) Spoken language understanding strategies on the
France Telecom 3000 voice agency corpus, IEEE ICASSP, Honolulu, Hawaii
22. R. De Mori, (1998) Spoken dialogues with computers. Academic Press.
23. L. De Raedt and K. Kersting (2003) Probabilistic Logic Learning. ACM SIGKDD exploration
newsletter, (5):1.
24. K. Dohsaka, N. Yasuda and F. Aikawa,, (2003) Efficient spoken dialogue control depending on the
speech recognition rate and system database. Eurospeech, Geneva, Switzerland, pp. 657–660.
25. J. Dowding, J. Gawron, et al., 1993). Gemini: A Natural Language System for Spoken-Language
Understanding. Spoken Language Systems Technology Workshop,MIT, Cambridge, Mass., pp.
20-23.
26. D. Dowty (1979). Word meaning and Montague grammar. Reidel, Dordrecht, the Netherlands.
27. E.W. Drenth and B. Ruber Context-dependent probability adaptation in speech understanding.
Computer Speech and Language, 11(3):225-252
28. H Erdogan, R. Sarikaya, S.F. Chen, Y. Gao and M. Picheny (2005) Using Semantic Analysis to
improve Speech Understanding Performance. Computer Speech and Language, 19(3):321-344.
29. L. D. Erman, F. Hayes-Roth, V. R. Lesser et R. D. Reddy.(1980) The Hearsay-II Speech
Understanding System : Integrating Knowledge to Resolve Uncertainty. ACM Computing
Surveys, 12(2):213-253.
30. J. Eun, M. Jeong, G. Geunbae Lee (2005) A Multiple Classifier-based Concept-Spotting
Approach for Robust Spoken Language Understanding. Eurospeech, Lisbon, Portugal, pp. 3441-
3444
31. C. J. Fillmore (1968). The case for case. in E. Bach and R. Harms eds. Universals in linguistic
theory, Holt, Rinehart and Winston, New York, 1968.
32. S. Furui, (2003) Robust Methods in Automatic Speech Recognition and Understanding Proc.
Eurospeech, Geneva, Switzerland, pp. 1993,1998.
33. D. Gildea and D. Jurafsky. (2002). Automatic labeling of semantic roles. Computational
Linguistics, 28(3):245–288.
34. V. Goel, H. K. J. Kuo, S. Deligne, C. Wu (2005) Language model estimation for optimizing end-
to-end performance of a natural language call routing system Proc. IEEE ICASSP, Philadelphia,
PA, USA, pp. I 565-568
35. L. Gorin (1995) On automated language acquisition. Journal of Acoustical Society of America,
97(6):3441-2461
36. L. Gorin, G. Riccardi, and J. H. Wright, (1997) How may I help you? Speech Communication, vol.
23, no. 1–2, pp. 113–127.
37. K. Hacioglu and W. Ward. (2003) Target word detection and semantic role chunking using
support vector machines. HLT-HLT-NAACL, Edmonton, Alberta, Canada
38. D. Hakkani-tur, g. Riccardi and a. Gorin (2002) Active learning for automatic speech recognition.
IEEE ICASSP, Orlando, FLA, USA
39. D. Hakkani-Tur G. Tur, M. Rahim G. Riccardi (2004) Unsupervised and active learning in
automatic speech recognition for call classification IEEE ICASSP, Montreal, Que, Canada, pp.I-
429-432.
40. D. Hakkani-Tur, F. Bechet, G. Riccardi and Gokhan Tur (2006) Beyond ASR 1-Best: Using
Word Confusion Networks for Spoken Language. Understanding, Computer Speech and
Language 20(4):495-514.
41. T. Hazen, T. Burianek, J. Polifroni, and S. Seneff (2000) Recognition confidence scoring for use in
speech understanding systems, Automatic Speech Recognition Workshop, Paris, France, pp. 213–
220.
42. Y. He and S. Young (2006) Spoken language understanding using the Hidden Vector State Model.
Speech Communication 48, 262–275
43. R. Higashinaka, N. Miyazaki, M. Nakano and K. Aikawa, (2004). Evaluating discourse
understanding in spoken dialogue systems. ACM Trans. Speech and Language Processing 1 (1),
1–20.
44. R. Jackendoff (1990). Semantic Structures, The MIT Press, Cambridge Mass.
45. R. Jackendoff (2002). Foundations of language, Oxford University Press, Oxford UK.
46. E., Jackson, D. Appelt, J. Bear, R. Moore and A. Podlozny (1991). A template matcher for
robust natural language interpretation. Speech and Natural language Workshop, 190-194,
Morgan Kaufmann, Los Altos, CA., USA
47. G. Ji and J. Bilmes (2005) Dialog act tagging using graphical models. IEEE ICASSP, pp.I 33-36.
48. H. Jiang (2005) Confidence measures for speech recognition: a survey. Speech Communication,
45(4):455-470.
49. N. Kambhatla (2004) Combining Lexical, Syntactic, and Semantic Features with Maximum
Entropy Models for Extracting Relations, ACL, pp. 177-180 (poster),, Barcelona, Spain
50. S.O. Kamppari, T.J. Hazen (2000). Word and phone level acoustic confidence scoring. IEEE
ICASSP, Istambuk, Turkey, pp. 1799–1802.
51. R.T. Kasper and E.H. Hovy (1990). Performing integrated syntactic and semantic parsing using
classification. Speech and Natural language Workshop, 54-59, Hidden Valley, PA, Morgan
Kaufmann, Los Altos, CA.,USA
52. R. J. Kate and R. J. Mooney (2007) Semi-Supervised Learning for Semantic Parsing using Support
Vector Machines. NAACL HLT 2007, pp. 81–84,
53. T., Kawahara,K. Tanaka and S. Doshita (1999) Virtual fitting room with spoken dialogue
interface. ESCA Workshop on Interactive Dialog in Multi-Modal Systems, Kloster Irsee
Germany, pp.5-8.
54. K. Kipper, H. T. Dang, and M. Palmer. 2000. Class based construction of a verb lexicon. AAAI
2000
55. D.H. Klatt (1977), "Review of the ARPA speech understanding project". Journal of the Acoustical
Society of America, 62( 6), pp. 2405-2420.
56. C. Kobus, G. Damnati, L. Delphin-Poulat, R. De Mori (2006) Exploiting semantic relations for a
Spoken Language Understanding application, ICSLP, Pittsburgh, Pennsylvania, PA, USA. pp.
1029-1032
57. D. Koller and A. Pfeffer (1998) Probabilistic frame-based systems. AAAI98, pp. 580–587,
Madison, Wisc., USA.
58. K. Komatani, T. Kawahara (2000) Flexible mixed-initiative dialogue management using concept-
level confidence measures of speech recognizer output. COLING, Vol. 1, pp. 467–473.
59. R. Kneser, J. Peters (1997) Semantic Clustering for Adaptive Language Modeling. ICASSP'97,
Munich, Germany, p. 779
60. R. Kuhn and R. De Mori (1995). The Application of Semantic Classification Trees to Natural
Language Understanding. IEEE Trans. on Pattern Analysis and Machine Intelligence, 17 : 449-
460.
61. H.K.J. Kuo and C.H. Lee (2003) Discriminative training of natural language call routers. IEEE
Trans. on Speech and Audio Processing, SAP-11(1):24-35
62. J. Lambek (1958). The mathematics of sentence structure. American mathematical monthly, 65 :
154-170.
63. I. Lane and T. Kawahara (2005) Utterance Verification Incorporating In-domain Confidence and
Discourse Coherence Measures. Eurospeech, Lisbon Portugal, pp.421-424
64. R. Lieb, T. Fabian, G. Ruske and M. Thomae (2004) Estimation of Semantic Confidences on
Lattice Hierarchies. ICSLP, Jeju Island, Korea,
65. Y.C. Lin and H.M. Wang (2001) Probabilistic concept verification for language understanding in
spoken dialogue systems. Eurospeech, Aalborg, Denmark, pp. 1049–1052.
66. Wei-Lin Wu, Ru-Zhan Lu, Hui Liu, Feng Gao (2006) A Spoken Language Understanding
Approach Using Successive Learner, ICSLP, Pittsburg, PA, USA, pp.1906-1909
67. Y. Liu, A. Stolcke, E. Shriberg, and M. Harper (2005) Using conditional random fields for
sentence boundary detection in speech. ACL, pp. 451–458.
68. M. Palmer, D. Gildea, and P. Kingsbury.( 2003) The proposition bank: An annotated corpus of
semantic roles. Computational Linguistics.
69. K. Macherey, F.J. Och and H. Ney (2001) Natural language understanding using statistical
machine translation. Eurospeech , Aalborg, Denmark, pp. 2205-2208
70. L. Mangu, E. Brill, and A. Stolcke (2000) Finding Consensus in Speech Recognition: Word Error
Minimization and Other Applications of Confusion Networks. Computer Speech and Language
14(4):373-400.
71. M. Mast et al, (1996) Dialog act classification with the help of prosody. ICSLP, , Philadelphia,
PA, USA.
72. J. McCarty and P. J. Hayes (1969). Some philosophical problems from the standpoint of artificial
intelligence. Machine Intelligence, Ed. by B. Meltzer and D. Michie, Edinburgh University Press.
73. M. McTear (2006) Spoken language understanding for conversational dialog systems, IEEE/ACL
Workshop on Spoken Language Technology Aruba.
74. H.M. Meng, W. Lam and C. Wai (1999). To believe is to understand. Eurospeech, Budapest,
Hungary.
75. G.A. Miller (1995) WordNet: A lexical database for English. Communications of the ACM,
38(11):39-41
76. S. Miller, R. Bobrow et al (1994). Statistical Language Processing Using Hidden Understanding
Models. Spoken Language Technology Workshop, 48-52, Plainsboro, New Jersey, Los Altos, CA.,
USA
77. B. Minescu, G Damnati, F. Béchet, R. De Mori (2007) Conditional use of Word Lattices,
Confusion Networks and 1-best string hypotheses in a Sequential Interpretation Strategy.
Interspeech, Antwerpen, Belgium
78. R. Montague (1974). Formal Philosophy. Yale University press, New Haven,Conn., USA
79. A. Muslea, (2000) Active Learning with MultipleViews, Ph.D. dissertation, Univ. Southern
California, Los Angeles, CA, USA
80. A. Nagai, Y. Ishikawa and K. Nakajima (1994) A semantic interpretation based on detecting
concepts for spontaneous speech understanding. ICSLP, Yokohama, Japan, pp. 95-98.
81. S. Narayanan. (1999). Moving right along: A computational model of metaphoric reasoning about
events. AAAI Menlo Park, CA, USA.
82. S. Narayanan. (1999) Reasoning about actions in narrative understanding. IJCAI. Morgan
Kaufmann Press.
83. A. Nasr, Y. Estéve, F. Béchet, T. Spriet, R. De Mori (1999) A language model combining n-grams
and stochastic finite state automata. Eurospeech, Budapest, Hungary, pp :2175-2178
84. N. Nilsson (1986) Probabilistic logic, Artificial Intelligence 28: 71-87,
85. C. Pao, P. Schmid, and J. Glass, (1998) Confidence scoring for speech understanding systems,
ICSLP, Sydney, NSW, Australia.
86. K.A. Papieni, S. Roukos and R.T. Ward R.T. (1998) Maximum likelihood and discriminative
training of direct translation models. IEEE ICASSP, Seattle WA
87. P. F. Patel-Schneider, P. Hayes and I. Horrocks (2003) OWL Web Ontology Language Semantics
and Abstract Syntax, W3C working Draft.
88. J. Pearl (1988) Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference.
Morgan Kaufmann, San Mateo, CA, USA.
89. F. Pereira (1990) Finite-state approximations of grammars Speech and Natural language
workshop, Hidden Valley, PA, pp. 12-19
90. R. Pieraccini, E. Levin and C.H.Lee (1991). Stochastic Representation of Conceptual Structure in
the ATIS Task. Proceedings of the, 1991 Speech and Natural Language Workshop, 121-124, Los
Altos, CA.
91. R. Pieraccini, E. Levln, E. Vidal. (1993) Learning How To Understand Language. Eurospeech, pp.
1407-1412. Berlin, Germany
92. A. Potamianos, S. Narayanan and G. Riccardi,(2005) Adaptive Categorical Understanding for
Spoken Dialogue Systems IEEE Trans; on Speech and Audio Processing,SAP-13 (2) : 321-329
93. S. S. Pradhan, H. Ward, K. Hacioglu, J.H. Martin and D. Jurafsky (2004) Shallow Semantic
Parsing using Support Vector Machines. HLT-NAACL, Boston, Mass., USA, pp.233-240.
94. S. S. Pradhan, H. Ward, and J.H. Martin (2007) Towards Robust Semantic Role Labeling.
NAACL HLT, Rochester NY, USA pp. 556–563,
95. N. Prieto, E. Sanchis and L. Palmero (1994) Continuous speech understanding based on automatic
learning of acoustic and semantic models. ICSLP, Yokohama, Japan
96. M. Purver, F. Ratiu, L. Cavedon (2006) Robust Interpretation in Dialogue by Combining
Confidence Scores with Contextual Features. ICSLP, Pittsburgh, PA, USA, pp. 1-4
97. O. Rambow, S. Bangalore, T. Butt, A. Nasr, R. Sproat (2002) Creating a Finite State Parser with
Application Semantics. COLING, Taipei
98. C. Raymond, F. Béchet , N. Camelin, R. De Mori, G. Damnati (2005) Semantic interpretation
with error correction IEEE ICASSP, Philadelphia, PA, USA
99. C. Raymond, F. Béchet R. de Mori and G. Damnati, (2006) On the use of finite state transducers
for semantic interpretation, Speech Communication, 48(3): 288-304.
100. C. Raymond, F. Béchet, N. Camelin, R. De Mori and G. Damnati (2007) Sequential decision
strategies for machine interpretation of speech, IEEE Trans. on Speech and Audio Processing,
15(1):162-171.
101. G. Riccardi, R. Pieraccini and E. Bocchieri (1996) Stochastic automata for language modeling.
Computer Speech and Language, 10(4):265-293.
102. G. Riccardi and D. Hakkani-Tur (2005) Active Learning: Theory and Applications to Automatic
Speech Recognition IEEE Trans. on Speech and Audio Processing,SAP-13 (4) : 534-545
103. S.D. Richardson, W. B. Dolan, and L. Vanderwende. (1998). MindNet: Acquiring and
Structuring Semantic Information from Text. ACL-COLING, Montreal, Canada, pp. 1098-1102.
104. M. Richardson and P. Domingos (2006) Markov Logic Networks, Machine Learning, 62:107-
136.
105. K. Ries (1999) HMM and neural network based speech act detection. IEEE ICASSP Phoenix,
AZ, USA.
106. B. Roark (2001) Probabilistic top-down parsing and language modeling. Computational
Linguistics 27 (2), 2–28.
107. R. Sarikaya, Y. Gao and M. Picheny (2004) A Comparison of Rule--Based and Statistical
Methods for Semantic Language Modelling and Confidence Measurement. HLT-NAACL, Boston,
Mass, USA, pp. 65-68
108. R. Sarikaya, Y. Gao, M. Picheny and H. Erdogan. (2005) Semantic Confidence Measurement
for Spoken Dialog Systems IEEE Trans. on Speech and Audio Processing, 13 (4) : 534-545.
109. R. E. Schapire, M. Rochery, M. Rahim, and N. Gupta (2005) BoostingWith Prior Knowledge for
Call Classification IEEE Trans. on Speech and Audio Processing,SAP-13 (2) : 174-182
110. E. Shriberg, A. Stolcke, D. Hakkani-Tur, and G. Tur, (2000) Prosody-based automatic
segmentation of speech into sentences and topics, Speech Communication, pp. 127–154, 2000.
111. S. Seneff S. (1989). TINA: A Probabilistic Syntactic Parser for Speech Understanding Systems.
IEEE ICASSP, 2 : 711-714. Glasgow, UK.
112. S. Seneff (1992). A Relaxation Method for Understanding Spontaneous Speech Utterances.
Proceedings of the, 1992 Speech and Natural Language Workshop, Los Altos, CA.
113. H. S. Seung, M. Opper, H. Sompolinsky (1992) Query by Committee. COLT 1992 287-294
114. Y. Shabes Y. and A. K. Joshi (1990). Two recent developments in tree adjoining grammars:
Semantic and efficient processing.. Speech and Natural language Workshop, 48-53, Los Altos,
CA.
115. A. Stolcke et al., Dialog act modelling for conversational speech. AAAI Spring Symp. on Appl.
Machine Learning to Discourse Processing, 1998, pp. 98–105.
116. K. Sudoh and M. Nakano (2005) Post-dialogue confidence scoring for unsupervised statistical
language model training. Speech Communication, 45(4):387-400).
117. K. Sudoh and H. Tsukada (2005) Tightly Integrated Spoken Language Understanding using
Word-to-Concept Translation. Eurospeech, Lisbon Portugal, pp.429-432.
118. J. Suzuki, H. Isozaki and E. Maeda (2004) Convolution Kernels with Feature Selection, for
Natural Language Processing Tasks. ACL, pp. 120-127, Barcelona, Spain
119. R. S. Swier, and S. Stevenson (2004) Unsupervised Semantic Role Labelling. EMNLP
,Barcelona, Spain, pp. 95—102
120. K. Tanigaki and Y. Sagisaka (1999) Robust speech understanding based on word graph
interface. ESCA Workshop on Interactive Dialog in Multi-Modal Systems, Kloster Irsee Germany,
pp. 45-48, 1999.
121. M. Thomae, T. Fabian, R. Lieb and G. Ruske (2005) Hierarchical Language Models for One-
Stage Speech Interpretation. Eurospeech, Lisbon, Portugal, pp. 3425-3428
122. M. Tomita (1986) An Efficient Word Lattice Parsing Algorithm for Continuous Speech
Recognition, IEEE ICASSP, Tokyo, Japan, p. 330
123. K. Toutanova, A. Haghighi, C. D. Manning (2005) Joint learning improves semantic role
labeling. ACL, Ann Arbor, Michigan, USA. pp.: 589 - 596
124. G. Tur, R.E. Shapire and D. Hakkani-Tur (2003) Active learning for spoken language
understanding. IEEE ICASSP, Hong Kong, China, pp. I-275,279
125. G. Tur (2005) Model adaptation for spoken language understanding. IEEE ICASSP,
Philadelphia, PA., USA, pp.I 41-44
126. G. Tur (2006) Multitask learning for spoken language understanding. IEEE ICASSP, Toulouse
(France), pp.585-588
127. Walker, D. (1975) The SRI speech understanding system. IEEE Trans. On Acoustics, Speech,
And Signal Processing, ASSP-23, NO- 5, pp. 397-416
128. D L. Waltz (1981) Toward a Detailed Model of Processing for Language Describing the Physical
World. IJCAI, Vancouver (BC), Canada, pp. 1-6
129. K. Wang (2004) A detection based approach to robust speech understanding IEEE ICASSP,
Montreal Canada, May, pp.I-413-416
130. W.Wang, Y. Liu and M.P. Harper, (2002) Rescoring effectiveness of language models using
different levels of knowledge and their interaction. IEEE ICASSP, Orlando, FLA, USA, pp785-
789
131. Y.Y. Wang and A. Acero (2003) Combination of CFG and N-Gram Modeling in Semantic
Grammar Learning Eurospeech, Geneva, Switzerland
132. Y.Y. Wang and A. Acero (2006) Discriminative Models for Spoken Language Understanding.
ICSLP, Pittsburg, PA., USA, pp. 2426-2429
133. W. Ward and S. Issar(1994). Integrating Semantic Constraints into the Sphinx-II Recognition
Search. IEEE ICASSP,pp. 17-19., Adelaide, Australia.
134. W.A. Woods , (1970) Transition Network Grammars for Natural Language Analysis,
Communications of the ACM, Vol. 13:10.
135. W.A. Woods (1975). What’s in a link? in D.G. Bobrow and A. Collins Eds, Representation and
understanding , Academic Press, New York.
136. W.A. Woods, et al (1976) Speech Understanding Systems. Bolt, Beranek and Newman Inc.,
Cambridge, MA., USA, Final Report, Vol. IV, V
137. C.-H.Wu, G.-L. Yan, and C.-L. Lin, (2002) Speech act modelling in a spoken dialog system
using a fuzzy fragment-class Markov model. Speech Communication, 38 (1-2), pp. 183– 199,
2002.
138. Wutiwiwatchai and S. Furui (2006) A multi-stage approach for Thai spoken language
understanding. Speech Communication 48 305–320
139. S. R. Young, A.G. Hauptmann, W. H. Ward, E.T. Smith and P. Werner (1989). High level
knowledge sources in usable speech recognition systems. Communications of the ACM, 32 (2) :
183-194.
140. K. Zechner(1998) Automatic construction of frame representations for spontaneous speech in
unrestricted domain. ACL-COLING, Montreal, Canada, pp. 1448-1452.
141. R. Zhang and A. I. Rudnicky (2002) Improve Latent Semantic Analysis based Language Model
by Integrating Multiple Level Knowledge. ICSLP, Denver, CO, USA.
142. M. Zimmermann and et al., (2005) Toward joint segmentation and classification of dialog acts in
multi-party meetings. 2nd MLMI, Edinburgh, UK.
143. Lytinen S. (1992). Semantic-first natural language processing. Proc. National Conference on
Artificial Intelligence , 111-116 San Jose CA.
144. Tait J..I. (1983). Semantic Parsing and syntactic constraints. In K. Sparck Jones and Y. A. Wilks
Eds. Automatic natural language parsing. Ellis Horwood/Wiley, Chichester.
145. Nilsson N., (1981) Principles of Artificial Intelligence . Tioga Press,
146. R. J. Brachman and H. J. Levesque Eds, (1985) Readings in Knowledge Representation, Morgan
Kaufmann, San Mateo, CA,.
147. R. J. Brachman (1985) On the epistemological status of semantic networks, in R. J. Brachman
and H. J. Levesque Eds, Readings in Knowledge Representation, pp. 191-216, Morgan Kaufmann,
San Mateo, CA,
148. Sowa J.F. (1984). Conceptual graphs: Information processing in mind and machine. Addison-
Wesley, Menlo Park, CA.
149. R. De Mori, R. Kuhn, and G. Lazzari (1991) A probabilistic approach to person-robot dialogue.
IEEEICASSP, Toronto, ON, Canada, pp.:797 – 800.
150. Castellanos A., Vidal E., Varo M.A. and Oncina J. (1998) Language understanding and
subsequential trtansducer learning. Computer Speech and Language, 12(3):193-228.
151. Fu K. S. (1982). Syntactic pattern recognition and applications. Prentice Hall, Englewood Cliffs,
NJ.
152. Vidal E., Pieraccini R . and Levin E.(1993). Learning associations between grammars: a new
approach to natural language understanding. Proc. Eurospeech 93 , 383-386, Berlin, Germany.
153. P. Norvig (1987)Inference in Text Understanding. AAAI, Seattle, WA. AAAI Press 561-565.
154. H. Kitano and T. Higuchi (1991). Massively parallel memory based parsing.. Proc. International
Joint Conference on Artificial Intelligence, 918-924, Sydney AUS.
155. J.R. Hobbs, M. E. Stickel, D.E. Appelt and P. Martin (1993). Interpretation as abduction.
Artificial Intelligence, 63 : 69-142.
156. G. Di Fabbrizio et al., 2002 ATT help desk, ICSLP, Denver, CO, pp.2681-2684 .
157. R. Engel (2002)Spin: Language understanding for spoken dialogue systems using a production
system approach. ICSLP, Denver, CO, pp.2717-2720.
158. J. Hirschberg, D. Litman, and M. Swerts (2004). Prosodic and other cues to speech recognition
failures. Speech Communication. 43, 155–175.
View publication stats

SLU Review

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

SLU Review

Caricato da

Copyright:

Formati disponibili

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Spoken language understanding

Article in IEEE Signal Processing Magazine · June 2008

Renato De Mori Dilek Hakkani-Tur

SEE PROFILE SEE PROFILE

Michael F. Mctear Giuseppe Riccardi

SEE PROFILE SEE PROFILE

Spoken Language Understanding View project

Making Sense of Human Conversations View project

The user has requested enhancement of the downloaded file.

A survey of research on spoken language understanding is presented. It covers aspects of knowledge

2. COMPUTER REPRESENTATIONS OF MEANING

Computer representation of meaning is described by a Meaning Representation Language (MRL) which

<frame> : <frame-name> <slots>*

A frame system is a network of frames.

⎧ins tan ce _ of ( x, address) ∧ loc( x, Avignon) ∧ area ( x , Vaucluse) ∧ ⎫

Cases proposed in (Fillmore, 1968) are:

Agentive (A) - the animate instigator for an action,

An instance of a verb frame could be:

which is representation of the predicate accept (user, service_004)

“What is the zip code of Avignon?”.

REQUEST(user, system, INFORMREF (y, loc(G1,Avignon ∧ zip(G1,y)))

An interpretation is an instance of a semantic structure in which a restriction is bound to a type|token

A heterarchical architecture based on a KB made of situation-action (production) rules is described in

3. SYNTACTIC AND SEMANTIC ANALYSIS FOR INTERPRETATION

Long Term Memory : AM LM interpretation KSs

signs concept structures

Short Term Memory

ASR S control ASR KS

Γ:[Action REQUEST ([Thing RESTAURANT], [Path NEAR ([Place IN ([Thing

The expression Γ can be obtained from a syntactic structure like this:

Ψ:[S[VP [V give, PR me] NP [ART a, N restaurant] PP[PREP near, NP [N Montparnasse, N station]]]]

• Each conceptual constituent supports the encoding of units (linguistic, visual,…).

• Many of the categories support type|token distinction (e.g; place_type place_token).

• Many of the categories support quantification.

• Some realizations of conceptual categories in conceptual structures can be decomposed into a

An association of semantic building formulas with syntactic analysis is proposed in categorial

write (S\NP)|NP λxλy ((WRITE x) y)

An example of ATNG is shown in Figure 3.

Figure 3 –Example of ATNG

TOLOC -> to CITY | ……..

signs concept structures

Short Term Memory

Semantic parsing is discussed in [144]. A semantic first parser is described in [143].

4. PARTIAL PARSING AND FALLBACK FOR SLU

[flight, [stops, nonstop],

More details and references can be found in ([22], ch. 14).

5. FINITE STATE PROBABILISTIC MODELS FOR INTERPRETATION

C' = arg max P(C / Y) = arg max P(Y / W ) P(CW )

signs concept structures

Short Term Memory

Figure 6 - Architecture with stochastic knowledge and automatic learning capability

P(C) is obtained with concept bigram probabilities.

The following frame is then obtained by hand-written rules:

More details and references can be found in [22].

Figure 7 - Example of an SCT

1. Collect a corpus of utterances in which each utterance is accompanied by its semantic

DISPLAYED_ATTRIBUTES (flights, fares)

6. CONCEPTUAL LANGUAGE MODELS

Pj ( w i Wii−−n1+1 ) be the probability distribution of a conceptual LM corresponding to the semantic

Pl,k (c j ) ≈ P(S k a τSl )P(a τ Sl )P(Sl ) ≈ P(S k Sl )P(Sl )

Figure 8 – Example of WFSM.

Figure 9 – Example of error modeling in FSMs.

Figure 10 - Network implementing the union of CLMs