Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Machine Learning
Machines can learn from examples
Learning modifies the agent's decision mechanisms to improve performance
Given training data, machines analyze the data, and learn rules which generalize to new examples
Can be sub-symbolic (rule may be a mathematical function) Or it can be symbolic (rules are in a representation that is similar to representation used for hand-coded rules)
In general, machine learning approaches allow for more tuning to the needs of a corpus, and can be reused across corpora
Sx distance[h(x; q) , f]
Empirical learning = finding h(x), or h(x; q) that minimizes E(h) Note an implicit assumption:
For any set of attribute values there is a unique target value This in effect assumes a no-noise mapping from inputs to targets This is often not true in practice (e.g., in medicine).
Observations:
Huge hypothesis spaces > directly searching over all functions is impossible Given a small data (n pairs) our learning problem may be underconstrained Ockhams razor: if multiple candidate functions all explain the data equally well, pick the simplest explanation (least complex function) Constrain our search to classes of Boolean functions, e.g., decision trees
Major issues
Q1: Choosing best attribute: what quality measure to use? Q2: Handling training data with missing attribute values Q3: Handling training data with noise, irrelevant attributes - Determining when to stop splitting: avoid overfitting
Major issues
Q1: Choosing best attribute: different quality measures. Information gain, gain ratio Q2: Handling training data with missing attribute values: blank value, most common value, or fractional count Q3: Handling training data with noise, irrelevant attributes: - Determining when to stop splitting: ????
Assessing Performance
Training data performance is typically optimistic e.g., error rate on training data
Reasons? - classifier may not have enough data to fully learn the concept (but on training data we dont know this) - for noisy data, the classifier may overfit the training data
In practice we want to assess performance out of sample how well will the classifier do on new unseen data? This is the true test of what we have learned (just like a classroom) With large data sets we can partition our data into 2 subsets, train and test - build a model on the training data - assess performance on the test data
Example
Example
Example
Example
Example
Predictive Error
Model Complexity
Predictive Error
Model Complexity
Underfitting
Overfitting
Predictive Error
Model Complexity
Ideal Range for Model Complexity
Validation Data
Si
Acc(i)
choose the method with the highest cross-validation accuracy common values for v are 5 and 10 Can also do leave-one-out where v = n
Training Data
1st partition
1st partition
2nd partition
More on Cross-Validation
Notes
cross-validation generates an approximate estimate of how well the learned model will do on unseen data
by averaging over different partitions it is more robust than just a single train/validate partition of the data
v-fold cross-validation is a generalization partition data into disjoint validation subsets of size n/v train, validate, and average over the v partitions e.g., v=10 is commonly used v-fold cross-validation is approximately v times computationally more expensive than just fitting a model to all of the data
Why do we care?
Sometimes, meaning changes a lot Transcribed speech lacks clear punctuation: I called, John and Mary are there. I called John and Mary are there. (I called John) and (Mary are there.) ?? I called ((John and Mary) are there.)
We can tell, but can a computer? Here, needs to know about verb forms and collections
Can be important! Quick! Wrap the bandage on the table around her leg! Imagine a robotic medical assistant with this one . . .
Some terms
Corpus Big body of text, annotated (expert-tagged) or not Dictionary List of known words, and all possible parts of speech Lexical/Morphological vs. Contextual Is it a word property (spelling) or surroundings (neighboring parts of speech)? Semantics vs Syntax Meaning (definition) vs. Structure (phrases, parsing) Tokenizer Separates text into words or other sized blocks (idioms, phrases . . . ) Disambiguator Extra pass to reduce possible tags to a single one.
Can have better treatment of context compared to HMMs (as well see)
rules which use the next (or previous) POS HMMs just use P(Ti| Ti-1) or P(Ti| Ti-2 Ti-1) rules which use the previous (next) word HMMs just use P(Wi|Ti)
Rule Templates
Brills method learns transformations which fit different templates
Template: Change tag X to tag Y when previous word is W Transformation: NN VB when previous word = to Change tag X to tag Y when previous tag is Z Ex: The can rusted. The (determiner) can (modal verb) rusted (verb) . (.) Transformation: Modal Noun when previous tag = DET The (determiner) can (noun) rusted (verb) . (.) Change tag X to tag Y when previous 1st, 2nd, or 3rd word is W VBP VB when one of previous 3 words = has
The learning process is guided by a small number of templates (e.g., 26) to learn specific rules from the corpus Note how these rules sort of match linguistic intuition
Error-driven method
How does one learn the rules? The TBL method is error-driven
The rule which is learned on a given iteration is the one which reduces the error rate of the corpus the most, e.g.: Rule 1 fixes 50 errors but introduces 25 more net decrease is 25 Rule 2 fixes 45 errors but introduces 15 more net decrease is 30 Choose rule 2 in this case
We set a stopping criterion, or threshold once we stop reducing the error rate by a big enough margin, learning is stopped
Rule ordering
One rule is learned with every pass through the corpus.
The set of final rules is what the final output is Unlike HMMs, such a representation allows a linguist to look through and make more sense of the rules
Thus, the rules are learned iteratively and must be applied in an iterative fashion.
At one stage, it may make sense to change NN to VB after to But at a later stage, it may make sense to change VB back to NN in the same context, e.g., if the current word is school
2. VBP VB PREV1OR20R3TAG MD
might/MD vanish/VBP-> VB
3. NN VB PREV1OR2TAG MD
might/MD not/RB reply/NN -> VB
4. VB NN PREV1OR2TAG DT
the/DT great/JJ feast/VB->NN
Insights on TBL
TBL takes a long time to train, but is relatively fast at tagging once the rules are learned The rules in the sequence may be decomposed into noninteracting subsets, i.e., only focus on VB tagging (need to only look at rules which affect it) In cases where the data is sparse, the initial guess needs to be weak enough to allow for learning Rules become increasingly specific as you go down the sequence.
However, the more specific rules generally dont overfit because they cover just a few cases
DT and TBL
DT is a subset of TBL
Transformation list:
Label with S: A/S A/S A/S A/S A/S A/S A/S If there is no previous character, then S F A/F A/S A/S A/S A/S A/S A/S If the char two to the left is labeled with F, then S F A/F A/S A/F A/S A/F A/S A/F If the char two to the left is labeled with F, then FS A/F A/S A/S A/S A/F A/S A/S F yes S no
DT and TBL
TBL is more powerful than DT Extra power of TBL comes from
Transformations are applied in sequence Results of previous transformations are visible to following transformations.
Most likely tag: P(NN|race) = .98 P(VB|race) = .02 Is/VBZ expected/VBN to/TO race/NN tomorrow/NN Rule template: Change a word from tag X to tag Y when previous tag is Z
Rule Instantiation for above example:
NN VB PREV1OR2TAG TO
4. Pick the I which results in the greatest error reduction, and add to output VB NN PREV1OR2TAG DT improves on 98 errors, but produces 18 new errors, so net decrease of 80 errors