Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Cache memory is random access memory (RAM) that a computer microprocessor can access more quickly than it can access regular RAM. As the microprocessor processes data, it looks first in the cache memory and if it finds the data there (from a previous reading of data), it does not have to do the more time-consuming reading of data from larger memory. The register is a small set of data holding places that are part of a computer processor. A register may hold a computer instruction, a storage address, or any kind of data.
processing.
Finite-state automata are efficient computational devices for generating regular languages. An equivalent view would be to regard them as recognizing devices: given some automaton A and a word w, applying the automaton to the word yields an answer to the question: Is w a member of the language accepted by the automaton? This reversed view of automata motivates their use for a simple yet necessary application of natural language processing: dictionary lookup. Example: Dictionaries as finite-state automata Many NLP applications require the use of lexicons or dictionaries, sometimes storing hundreds of thousands of entries. Finite-state automata provide an efficient means for storing dictionaries, accessing them and modifying their contents. To understand the basic organization of a dictionary as a finite-state machine, assume that an alphabet is fixed (we will use = {a, b, , z}in the discussion) and consider how a single word, say go, can be
represented. As we have seen above, a naive representation would be to construct an automaton with a single path whose arcs are labeled by the letters of the word go: go: To represent more than one word, we can simply add paths to our lexicon, one path for each additional word. Thus, after adding the words gone and going, we might have:
With such a representation, a lexical lookup operation amounts to checking whether a word w is a member in the language generated by the automaton, which can be done by walking the automaton along the path indicated by w. This is an extremely efficient operation: it takes exactly one step for each letter of w. We say that the time required for this operation is linear in the length of w. The organization of the lexicon as outlined above is extremely simplistic. The lexicon in this view is simply a list of words. For real application one is usually interested in associating certain information with every word in the lexicon. For simplicity, assume that we do not have to list a full dictionary entry with each word; rather, we only need to store some morpho-phonological information, such as the part-of-speech of the word, or its tense (in the case of verbs) or number (in the case of nouns). One way to achieve this goal is by extending the alphabet : in addition to ordinary letters, can include also special symbols, such as part-of-speech tags, morpho-phonological information, etc. An analysis of a (natural language) word w will in this case amount to recognition by the automaton of an extended word, w, followed by some special tags Example: Adding morphological information to the lexicon Suppose we want to add to the lexicon information about part-of-speech, and we use two tags: -N for nouns, -V for verbs. Additionally, we encode the number of nouns as -sg or -pl, and the tense of verbs as -inf, -prp or -psp (for infinitive, present participle and past participle, respectively). It is very important to note that the additional symbols are multicharacter symbols: there is nothing in common to the alphabet symbol -sg and the sequence of two alphabet letters (s,g) In other words, the extended alphabet is:
The language generated by the above automaton is no longer a set of words in English. Rather, it is a set of analyzed strings, namely
Weighted finite-state transducers (WFSTs) generalize WFSAs by replacing the single transition label by a pair (i,o) of an input label (i) and an output label (o). While a weighted acceptor associates symbol sequences and weights, a WFST associates pairs of symbol sequences and weights, that is, it represents a weighted binary relation between symbol sequences. Consider the pronunciation lexicon in the figure
Suppose we form the union of this transducer with the pronunciation transducers for the remaining words in the grammar G of Figure (a) and then take its Kleene closure by connecting a -transition from each final state to the initial state. The resulting pronunciation lexicon L would pair any sequence of words from that vocabulary to their corresponding pronunciations. Thus, the following figure gives a description of such scheme that can be used extensively in text-to-speech applications.
With the same methodology, we can represent HMMs for phones mapping them to sequence of words of the given language, which is the main mechanism is speech recognition applications.