StCharlesW (2008)

NounPhraseExtraction
AnEvaluationandDescriptionofCurrentTechniques
WhitneySt.Charles DepartmentofComputerScience TheUniversityofTennesseeatChattanooga 615McCallieAve. Chattanooga,TN37403
Page|1
TableofContents
TableofContents.......................................................................................................... 2 1 2 INTRODUCTION..................................................................................................... 3 BACKGROUND....................................................................................................... 4 2.1 3 TheNounPhraseExtractionTask................................................................... 4
Methods................................................................................................................ 7 3.1 3.2 Static,NonadaptableParsers........................................................................ 7 MachineLearningTechniques ..................................................................... 12 TransformationBasedLearners ........................................................... 15 MemoryBasedLearners....................................................................... 17 MaximumEntropyLearners ................................................................. 19 HiddenMarkovLearners....................................................................... 20 ConditionalRandomFieldLearners...................................................... 23 SupportVectorMachineLearners........................................................ 24
3.2.1 3.2.2 3.2.3 3.2.4 3.2.5 3.2.6 4
AnalysisandConclusions .................................................................................... 26 4.1 4.2 4.3 NounPhraseExtractionAccuracyComparison............................................ 26 AdvantagesandDisadvantagesofTechniques............................................ 28 DataAbstractionComparison ...................................................................... 29
WorksCited................................................................................................................. 31
Page|2
INTRODUCTION
NaturalLanguageProcessing(NLP)isconcernedwiththeautomated,computer understandingofhumanlanguage.NounPhraseExtraction(NPE)isarguablyoneof themostcriticalcomponentsofthistask.Forexample,informationretrievalsystems ofallkindsrelyonbasenounphrasesastheirprimarysourceofentityidentification. Becausethisissuchanecessarytasktonaturallanguageprocessing,therearea multitudeofalgorithmsdesignedtohandleit.Idecidedthatathoroughreviewand comparisonofthetopseventechniqueswouldbeofgreatinteresttocomputer scientistsstudyingNLP. ThetechniquesIhaveselectedrepresentsevencompletelydistinctapproachesto thistask:staticparsing,transformationbasedlearning,memorybasedlearning, maximumentropy,hiddenMarkovmodel,conditionalrandomfields,andsupport vectormachines.Thelattersixarevariousmachinelearningtechniques.Eachof theseisbasedonthepremisethatcertainrulescanbegleanedthroughtrainingthe learneronaparticulardataset.GivenaparticularEnglishlanguagecorpus,orbody oftexts,alearnerwilldevelopaparticularsetofprobabilitieswhichwillallowitto formulateitsownunderstandingoftherulesoftheEnglishlanguage.Thispaper reviewseachofthesetechniquesonacommondatasetandtrainedonacommon Englishlanguagecorpus.
Page|3
2
2.1
BACKGROUND
TheNounPhraseExtractionTask
NounphraseextractionliesatthebaseofmanyareasinNaturalLanguage Processing.Sowhatmakesthissuchanimportantsubtask?Tounderstandthis,a deeperunderstandingofnounphrasesandtheirsyntacticsignificanceisrequired. Anounphraseisonewhoseheadisanounorpronoun,andmaybeoptionally accompaniedbyasetofmodifiers.Modifiersincludedeterminers,adjectives, prepositionalphrases,andrelativeclauses.Determinersarecomprisedofarticles (a,the),demonstratives(this,those),numerals(one,two),possessives (his,her),andquantifiers(some,many).Examplesofsomenounphrases include: theDETredADJballNN theDETbooksNN[thatIboughtyesterday]RC theDETmanNN[withtheblackhat]PP Byidentifyingnounphrases,theprimaryentitiesinasentenceareisolated.Noun phrasescontainthekeyactorsinasentencesuchastheman.Nounsintrinsically
Page|4
identifywhataparticularpieceofwritingisabout.Forexample,considerapieceof writingaboutairplanesafety.Itwouldlikelycontainverbssuchasread,exit,or notify.Withonlythesewordstolookat,itwouldbedifficulttoguesswhatthe textwasabout.However,thenounsinthepiecemightincludeplane,seatback, oremergency.Thesearemuchmoreinformativeaboutthesubjectmatterand contextofthetext,andthereforearequitevaluabletotheNLPtask. Atfirstglancethismayseemlikeafairlysimpletask.Englishinparticularhas apparentlyunambiguousrulesfordefiningnounphrases.TheNLPsubtaskofParts OfSpeech(POS)tagginghasalreadybeenaddressedwithlargesuccess(Brill). Consequentlybecausethepartsofspeechofeachwordcanbedetermined,the taskofcompoundingtheseintonounphrasesisafairlysimpleone.Isthisenough, however?Considerthefollowingexample:ThemanwhoseredhatIborrowed yesterdayinthestreetthatisnexttomyhouselivesnextdoor.Usingtherules listedabove,thatsentencewouldbeparsedasfollows: [TheDETmanNN[whoseredhat[Iborrowedyesterday]RC]RC[inthestreet]PP [thatisnexttomyhouse]RC]NPlives[nextdoor]NP. Thetwonounphrasesthatareidentifiedhereare,infact,correct,buttheyarenot irreducible.Bydefinition,nounphrasesareextendedbybothrelativeclauses
Page|5
(delineatedherebyRC)andprepositionalphrases(PP).Thisparsingcompletely ignoresthemostbasenounphrasescontainedinthesentence,however.Amore usefulparsingmaybethefollowing: [[Theman]NPwhose[redhat]NPIborrowed[yesterday]NPin[thestreet]NP thatisnextto[myhouse]NP]NPlives[nextdoor]NP Allofthemajorentitiesinthissentencehavenowbeenidentified.Thisis especiallyusefulinthedocumentclusteringNLPtask,forexample,inwhich documentsareclusteredintomeaningfulcategorieswhichareindicativeofsubject matter.Inordertoachievethisgrouping,thedocumentsmustbecomparedtoone anotherintermsoftheirentitycontent.Ifthenounphrasesfromthefirstexample werepulledfromaparticulardocument,thelikelihoodoffindinganotherdocument withthisexactsame,longnounphrasewouldbeverylow.Ifextremelylong, complexnounphrasesareextracted,theiruniquenessnegatesthechanceoffinding matchingphrasesagainsttheotherdocumentsinaset.Eachphrasesimplycontains toomuchinformationtobeuseful.Thisisonereasonwhyirreducibleorbase nounphrasesaremuchmoredesirable.Therearetwopredominantmethodsfor basenounphraseextraction:static,nonadaptableparsersandthosebasedon machinelearningtechniques.
Page|6
3
3.1
Methods
Static,NonadaptableParsers
Staticparsersarethemostuncomplicatedandcomputationallyinexpensivetype.A setofstaticrulesaredefinedtoparsenounphrasesfromtext.Oftentheserulesare describedintermsofagrammarorfinitestateautomaton(FSA).Forexample, considertherule:anarticleprecedinganounisincludedinthenounphrase. Consequently,thismeansthatanounafteranarticlecannotmarkthebeginningof anounphrase.Thefollowingexampleexpressesthisruleaswellasthegeneral descriptionlistedaboveintermsofafinitestateautomaton1:
Figure1:Simple,FSAnounphraseidentifier
Notethatthisisanextremelysimple,nonrobustFSA,meantonlytoillustratetheconcept.Itisin nowayaworkableparser
1
Page|7
TheinitialstateisstateS0andasthephraseistraversed,transitionsaremade accordingly.Forexample,considerthephrase: thecrankymanwiththeredhatthatIknowfromschool ThedeterminertheallowsustotakethefirsttransitiontoS1.Crankyisan adjective,thereforeweloopbackandstayinS1.Manisanoun;thereforewe transitionintotheacceptingstateandrecognizethatwehaveanounphrase.With theredhatisaprepositionalphraseandthatIknowfromschoolisarelative clause,thereforewecontinuetostayintheacceptingstate,NP. Oncetheinitialgrammaticalruleshavebeenformulated,thereareseveralwaysto realizetheactualparser.Afinitestateautomatonliketheoneaboveisusually expressedbyusingaregulargrammar(Abney,PartialParsingviaFiniteState Cascades).Forexample,theFSAabovecouldalsobeconveyedthefollowingway:
S0dS1|aS1|nNP|pNP|dNP S1aS1|nNP|pNP NPrcNP|ppNP

Figure2:Aregulargrammar,expressingtheFSAshowninFigure1
Page|8
Intheregulargrammarabove,thesetofnonterminals(V)consistsofthesetofFSA states:{S0,S1,NP}.Theterminalsinthiscasearethepartsofspeechsuchasa,n,p, anddwhichareadjectives,nouns,pronouns,anddeterminersrespectively.To simplifythismachineIhavedesignatedrcandppasterminals,butassumethatthey correctlyidentifybothrelativeclausesandprepositionalphrasesrespectively. StevenAbneystechnique,basedonthisprinciple,wasabletoachieverelatively highaccuracy(Abney,PartialParsingviaFiniteStateCascades). Oncearegulargrammarlikethisoneisformulated,anotherpossible implementationinvolvesanLRparser.AnLRparserreadsinputfromlefttoright, andisguidedbyrightmostderivations.Considerthefollowingparsetreeforthe sentence:Ieatcookies.
1.) 2.) 3.) 4.)
Productions
S NPVP NP DetN NP N VP VNP

Figure3:LRParseTreeandContextFreeGrammar
Page|9
ThisparsetreeisassembledbytheLRparserfromthebottomup,andderivations aremadefromrighttoleft.Iisthefirstinputthatisread;noticethatitisnot cookiesbecauseinputisreadfromlefttoright,insteadofrighttoleft.However, ateachstepalongtheway,rightmostderivationsaretaken.WhenIistheonly inputread,ofcourseitistherightmostandthereforeproductionnumber3canbe used.Thenbotheatandcookiesarereadbeforethenextderivationcanbe made;nowproduction3whichtakescareofcookiesischosen,becauseitisthe rightmost.Nowproduction4canbetaken,whichclumpseatcookiesintoaverb phrase.Inthisway,theparsercanbuildthecorrectparsetreefromthebottomup. Aparserusingtheserules,andseveralotherswasoneoftheearlyattemptsatthe nounphraseextractiontask(Abney,ParsingbyChunks). Theeffectivenessofthesetechniquesisentirelydependentuponthelinguists formulatingtheruleset.Withanextremelyaccuratesetofrules,nounphraseswill becorrectlyidentifiedmoreofthetime.However,whilewelearngrammarin schoolasaseeminglysimple,unvaryingsetofrules,therearesomanyexceptions thattheyarentaseasilyapplicableasonemightthink.Infact,itisverydifficultto formulateanaccuratesetofrulestodefineanounphrase.Thesubjectivenatureof languagemeansthattherearesimplytoomany,unanticipated,andundefinedrules.
Page|10
Syntacticambiguityisonesuchissuewhichcanthrowoffstaticparsers.Thisoccurs whenasentencecanfeasiblybeinterpreteddifferentways.Forexample,consider followingtwoparsetreesofthesentenceIsawthemanwiththetelescope.
Figure4:Twoviable,alternativeparsetreesforthesentence
Inthefirsttree,itisclearthatIwaslookingthroughatelescopewhenIsawthe man.ThesecondparsetreeshowsthatIsawamanwhowasholdingatelescope.It isclearhowthesealternativeinterpretationsaffectthenounphraseparsingofthe

Page|11
sentence.Inthefirstexample,thenounphraseconsistsonlyoftheman,whereas inthesecondexample,thephraseincludeswiththetelescope.Withoutmore informationaboutthecontextofthissentence,therewouldbenowaytoconclude whichparsingiscorrect.Anonadaptableparserwouldsimplypickoneofthese parsetreesortheother,everytimeitcameacrossthisparticulargatheringofparts ofspeech.Syntacticambiguitylikethisoccurssooften,thatthisbecomesamajor failingpointforstaticparsers.
3.2
MachineLearningTechniques
Machinelearningmethodsaimtocorrectmanyoftheshortcomingsofthesimple extractors.Theyhavetheabilitytodiscoverthoseunspokenrulesoflanguage.The moreexamplesalearnersees,themorelikelyitistodiscovernewrelationships. Withthisinformation,thelearnerisabletoformulateandupdaterules.These techniquesarebasedprimarilyonprobabilityandstatistics.Dependinguponhow wellagivenruleperformsonaparticulardataset,certainweightsareassigned. Learnerscanevenaddressproblemslikestructuralambiguitybydiscoveringthe contextofasentence.GivenvariousPOSconfigurations,amachinelearnercan formulatestatisticsaboutwhatisthemostlikelynounphraseparsetree. Trainingisanintegralpartofanymachinelearner.Innaturallanguageprocessing, weoftenrelyonalarge,representativeEnglishlanguagecorpusonwhichtotrain.
Page|12
ThiscorpuscontainsalargenumberofEnglishwordsaswellasmanydifferent usagesofthosewords.Sentencesaretaggedbyhandwithbothpartsofspeechand nounphraseoccurrences.Inthecomputationallinguisticsense,acorpusisalarge, structuredsetoftextswhichismeanttobearepresentativesampleofaparticular genreorlinguisticpattern(Sinclair).OftenfortheNPEtask,itismostadvantageous totrainonanextremelygeneralcorpus,thatis,onethatisrepresentativeofthe Englishlanguageasawhole.ConsideracorpusconsistingonlyofShakespeares collectedwritings,forexample.ThesentencestructuresusedbyShakespeareare ratherunique;alearnertrainedonthiscorpusmightperformverywellwhen parsingShakespeare,orperhapsevenvariousothersixteenthcenturypoets. However,itmightperformverypoorlyifappliedtoMySpaceblogs,medicaltexts,or lawbriefs.Thisisbecauseeachofthesegenrescontain,largenumbersof particularmicrolinguisticfeaturessuchaspropernounsorpassiveverbphrases (Sinclair).Therefore,agoodEnglishlanguagecorpusattemptstoencompassallof thelinguisticvariationsinappropriateproportions.NearlyallresearchersofNPE recognizethePennTreebankasthemostappropriateandcompletecorpusfortheir purpose. ThePennTreebankisanannotatedcorpuscontainingover4.5millionEnglish languagewords(Marcus,SantoriniandMarcinkiewicz).Itcontainslargeselections fromTheWallStreetJournal.Nounphrases,prepositionalphrases,andmanyother
Page|13
grammaticalpatternsaretaggedinthePennTreebank.Thisenablesthemachine learnerstobetrainedonknowndata.Becauseeachofthealgorithmsdiscussedin thispaperhavebeentrainedonthePennTreebank,theyareallexamplesof supervisedlearners.Thismeanssimplythattheyeachhaveaccesstocorrectdata, andlearnbymappingcorrectinputstodesiredoutputs.Anunsupervisedlearner wouldnothaveaccesstocorrectdata,butinsteadwouldbeforcedtolearnbased onsomekindofrewardsystem.Unsupervisedlearnersmightbeespeciallyusefulto scientistslookingtodiscovernewinformationaboutaparticulartopic,buttheyare inefficientandunneededwhenitcomestotaskslikegrammarclassification. Furthermore,thesealgorithmsareallinductive,becausetheyinducecertainrules fromcorrectexamples.ThefollowingisasmallexcerptfromthePennTreebank:
Figure5:AsmallsamplefromthePennTreebank
Oncealearnerhasbeenappropriatelytrained,itmaythenbeappliedanuntagged versionofacorpusselection,andthencheckedagainstthatsame,taggedportion foraccuracy.UntaggedinthiscasereferstotextwhichhasbeentaggedwithPOS, butnotbracketedtoindicatenounphrases.Itisimportantthatthelearnerbe

Page|14
testedonacorpusorselectionthathasnotbeenpartofthetrainingsetatany point.Otherwise,rulesgleanedfromthatselectionwouldbecertaintoensurehigh accuracyonthetestset.Thoughitisnotaforegoneconclusionthatalearner trainedonaparticularportionofacorpuswouldhaveonehundredpercent accuracyandrecallonthatsameportion,itisstillgoodscientificmeasuretoteston otherdatasets,tonegatethepossibility. 3.2.1 TransformationBasedLearners OneofthefirstmachinelearningtechniquesappliedtotheNPEtaskwasthe transformationbasedtechniquepioneeredbyLanceRamshawandMitchMarcusin 1995(RamshawandMarcus).ThoughthistechniquewasinitiallycoinedbyEricBrill forthePOStaggingtask(Brill),ithasbeenapersistentandeffectivemeansofNPE parsing.Transformationbasedlearningisanerrordrivenapproach.Thismeans thatitconstantlygeneratesnewrulesasitisbeingtrained,anditthenscoresthose rulesindividually.Thealgorithmwhichgeneratestheseisfairlysimple:
Figure6:Memorybasedlearningalgorithm
Page|15
Thoughthisalgorithmmaygetfairlymemoryintensiveduringthetrainingperiod,it thenproducesanumberofstaticruleswhichitwillmatchtogivenPOSsequences. Considerthisalgorithmappliedtothefollowingexample:
Figure7:Transformationbasedlearningexample
EachofthefirstthreerulesshowninFigure7wouldmostlikelyscorequitehighly. Thelastrule,however,iscompletelywrong,however.Nounscaneasilycomeafter verbswithoutbeingpartofanounphrase;theymayinsteadbepartofaverb phrase,forexample.Butthisrulewilllikelyscorepoorlyforthisreason,and thereforeitwillbeeliminatedfromtheruleset.Noticealsothattherulescan encompassbothwordsbeforeandafterthewordinquestion.Inthisway,machine learnerscanhandlecontext.ThetransformationbasedparserbyRamshawand
Page|16
Marcuscanlookuptotwotagsbeforeandtwotagsafterthecurrentwordinorder toformulaterules. 3.2.2 MemoryBasedLearners Memorybasedlearnersclassifydataaccordingtosimilaritiestootherdataobserved earlier.Thisissimilarinsomewaystohowahumanoranimalmightmakedecisions aboutanunknownquantity.Forexample,ifIweretoobserveaglowing,redcoilon astovetopwithouteverhavingseenonebefore,Imightextrapolatethatitishot basedonitsvisualsimilaritytootherhotthingsIhaveobservedinthepast,suchas theglowingembersinafire.Amemorybasedlearnerdoesthistoanextreme degree;thatis,itrememberseverythingthatithaseverencountered.This property,termedaslazylearning,isuniquetomemorybasedlearners,asmost othertechniquesrememberonlyasmallsubsetoftheconfigurationstheymight encounter.Machinelearningtechniqueswhichabstractknowledgefromthe trainingdataandforgeteverythingelseareknownasgreedylearners(Daelemans, BuchholzandVeenstra).
Page|17
Figure8:Memorybasedlearningalgorithm
Theconceptofdeterminingdistancebetweenvaryinggrammaticalconfigurations canbedifficult.Thoughthereareseveralcomplextechniquesforaccomplishing this,themethodusedbytheNPEparsersexploredinthispaperisanoverlapping function;thisfunctionsimplycalculatesthenumberofoverlappingfeaturesthatan instancehaswithanotherinstanceoccurringinmemory.Forexample,considerthe followingcandidatesequence: theDETbeautifulADJ,talentedADJcollegeNNstudentNN
Thememorybasedlearnerlookstoitsrulesetanddeterminesthatthefollowing threesequenceshavethehighestnumberofoverlappingfeatures,andtherefore theyareaddedtothesetofnearestneighbors:

Table1:NearestNeighborList
DETADJNNNN DETADJADJNNNNP
NP NP NotanNP
VDETADJNNNN
Page|18
Themajorityofthenearestneighborsareclassifiedasnounphrases,thereforethe newconfiguration(DETADJADJNNNN)isclassifiedasanounphrase,also,and addedintomemory. 3.2.3 MaximumEntropyLearners Maximumentropylearnersuseastatisticalmethodtopredictwhichpossibilityis themostlikely,thatis,whetherthecurrentwordbegins,fallswithin,orfallsoutside anounphrase.Theyworkontheassumptionthattheprobabilitydistributionwhich maximizesinformationentropy,themeasureofuncertaintyassociatedwitha randomvariable,isalsotheleastbiased.Considerthatwehavemunique propositions.Themostinformativedistributionisoneinwhichweknowoneofthe propositionsistrue;informationentropyis0.Theleastinformativedistributionis oneinwhichthereisnoreasontofavoranyonepropositionoveranother; informationentropyislogm(Berger,DellaPietraandDellaPietra).Tobetter understandthisprinciple,takeintoaccountthefollowingexample: Thetagconfiguration(DETADJNN)beginsanounphrasesomeofthetime(B),falls inthemiddleofanexistingnounphrasesomeofthetime(I),andconcludesanoun phrasetheremainingtime(O).
Page|19
P(B)+P(I)+P(O)=1
Tobeginwith,thelearnerassumesthateachoftheseprobabilitiesisevenly distributed. P(B)=P(I)=P(O)=.33 Nowsupposethelearnerdiscoversthatthisphraseeitherbeginsorendsanoun phrasehalfofthetime. Witheachnewlearnedconstraint,theweightsareredistributedtocreatethe flattestpossibleprobabilitydistribution.Thistechniqueisparticularlysimpleto implement,giventhattherearegeneraltoolsthathavebeencreatedtoapplythis algorithm,suchasMaccent(Koeling). 3.2.4 HiddenMarkovLearners TheselearnersarebasedontheHiddenMarkovModel(HMM).Thisstatistical modelcanbeappliedaslongasthesystembeingmodeledpossessesthehidden Markovproperty.AsystemwiththehiddenMarkovpropertymusthaveadiscrete numberofpossiblestates,andtheprobabilitydistributionoffuturestatesmust dependonlyonthepresentstateandbecompletelyindependentofpaststates. P(B)=P(O)=.25 P(I)=.5
Page|20
Thesestatesarenotdirectlyobservable.ConsiderthefollowingMarkovsystem (Wikipedia): Youhaveafriendwholivesfaraway.Youcallhimdailytohearwhathedidthat day,whichisoneofthreeactivities,walking,cleaning,orshopping.Youknowthat hisactivitiesarelinkedtotheweather,whichcouldbeinoneoftwostates,rainyor sunny.Thoughyoucannotdirectlyobservetheweather,youcanattempttoguess whatitmayhavebeenlike,giventheactivitiesthatyourfrienddidthatday.Inthis system,theweatherpossessesthehiddenMarkovproperty.Thissystempossesses certaintransitionprobabilities;forexample,thelikelihoodthatitwillcontinueto rainifthecurrentconditionisRainyis70%,whilethechancesthatitwillsuddenly becomesunnyareonly30%.Youalsoknowthattherearecertainemission probabilities;ifitissunny,youknowthatthereisa60%chancehewillwalk,a30% chancehewillshop,andonlya10%chancethathewillclean.Thissystemisfully describedbelow.
Page|21
states(x1,x2,x3)=
{Rainy,Sunny,Cloudy}
observations(y1,y2,y3)= {walk,shop,clean} start_probability= {Rainy:0.3,Sunny:0.4,Cloudy:0.3}
transition_probability(a12,a21,a23)={ Rainy:{Rainy:0.4,Sunny:0.3,Cloudy:0.3}, Sunny:{Rainy:0.4,Sunny:0.5,Cloudy:0.1}, Cloudy:{Rainy:0.4,Sunny:0.3,Cloudy:0.3}} emission_probability(b1,b2,b3)={ Rainy:{walk:0.1,shop:0.4,clean:0.5}, Sunny:{walk:0.6,shop:0.3,clean:0.1}, Cloudy:{walk:0.4,shop:0.1,clean:0.5}}
Figure9:HiddenMarkovModelProbabilisticParameters
Inthecaseofnounphraseextraction,thehiddenpropertyistheunknowngrammar rule.Observationsareformedbythetrainingdata.Contextualprobabilities representthetransitionstates;thatis,givenourprevioustwotransitions,whatis thelikelihoodofcontinuing,ending,orbeginninganounphrase/P(oi|oj1,oj2).

Page|22
Emissionoroutputprobabilitiesarethechancesthatthecurrentwordwillbegin, continue,orendanounphrasegiventhecurrentstate. 3.2.5 ConditionalRandomFieldLearners ConditionalRandomField(CRF)machinelearnersareactuallyquitesimilarto HiddenMarkovlearners.TheprimaryadvantageofCRFsoverhiddenMarkov modelsistheirconditionalnature,resultingintherelaxationoftheindependence assumptionsrequiredbyHMMs.ThetransitionprobabilitiesoftheHMMhave beentransformedintofeaturefunctionsthatareconditionalupontheinput sequence. YicouldbeB,I,Ointhefollowingexample:
Figure10:CRFProbabilisticParameters
Inthisundirectedgraphicalmodel,eachvertexrepresentsarandomvariablewhose distributionistobeinferred,andeachedgerepresentsadependencybetweentwo
Page|23
randomvariables.InaCRF,thedistributionofeachdiscreterandomvariableYin thegraphisconditionedonaninputsequenceX. 3.2.6 SupportVectorMachineLearners Supportvectormachines(SVM)arethemostcomplexmachinelearningtechnique exploredinthispaper,howevertheyarealsothemostaccurateand computationallyinexpensive.SVMsarebinaryclassifiers.Thismeansthattheyare bestusedtoseparateoneclassofitemsfromanother.IntheNPEcase,theycanbe usedtoseparatenounphrasesfromnonnounphrases.Thisisdonebyrepresenting eachofthepossiblegrammaticalconfigurations,suchas(NNPV)inthesetasan orderedpair,ororderedntuple.Thisconfigurationisthenmappedintoann dimensionalspace,wherenisthenumberoffeaturesthattheconfigurationsshould becomparedon(suchaslengthoramountofoverlap).Asitismapped,itisalso assignedacolor,suchasredorblue,whichrepresentswhetherornotthe configurationisanounphrase.Oncethismappinghasbeendoneforallitemsinthe trainingset,thesystemattemptstoseparatethereditemsfromtheblueones, usingann1dimensionalhyperplane.Simplyput,thismeansthatifthesystem graphsintotwodimensionalspace,thefeaturesareseparatedbyaline;ifthe systemgraphsintoathreedimensionalspace,thefeaturesareseparatedbyaplane. Figure10illustratessupportvectorsin2dimensionalspace.Itshouldbenotedthat
Page|24
anSVMwillalwaysattempttomaximizethemarginbetweenthetwoclassesof pointstoensurethemostgeneralization.
Figure11:SupportVectorsin2dimensionalspace(KudoandMatsumoto,ChunkingwithSupport VectorMachines)
Onceagraphofthefeaturespaceisknownitcanthenbeappliedtonewdata.As newphrasesarereadintotheSVM,theyareplottedaccordingly.Thephrasesmust fallononesideofthelineortheother,therebyclassifyingthemeitherasnoun phrasesornot.ThoughanSVMcanhandleliterallymillionsofconfigurations,itis stilloneoftheleastcomputationallyexpensive.Thisisbecauseeachconfiguration issimplyrecognizedasanorderedpair.Thereisnolookupwhichtakesplaceduring theparsingofthetexteither.Eachphraseissimplymappedintothespaceanda yesornoanswerisimmediatelydelivered.
Page|25
AnalysisandConclusions
Eachofthesetechniqueswastrainedandtestedonacommondatasetforthe CONLL(ConferenceonNaturalLanguageLearning)2000sharedtask.Thisdatasetis aselectionfromthePennTreebank.TheperformancemetricusedisFmeasure. Thismeasuretestsboththeprecisionandrecallofagiventechnique.Inthis equation,referstotheweightofprecisiontorecall,whichisalways1forthe experimentsexploredbythispaper.
Figure12:DefinitionsforPrecision,Recall,andFMeasure
4.1
NounPhraseExtractionAccuracyComparison
Thefollowingtablecontainsachronologicallistingoftheforemostimplementations ofthetechniquesexploredbythispaper.
Page|26
PrimaryWork MarcVilain, DavidDay(Vilain andDay) LanceRamshaw, MitchMarcus (Ramshawand Marcus) RobKoeling (Koeling) TakuKudoh, YujiMatsumoto (Kudoand Matsumoto, Chunkingwith SupportVector Machines) ErikTjongKim Sang(TjongKim Sang) AntonioMolina FerranPla (MolinaandPla) Feisha, Fernando Pereira(Shaand Pereira)
Year ParsingMethod 1995 Static,Rule based
Implementation Title Alembic
Language Performance Measure LISP 77
1995 Transformation based
Marc Greenwood Chunker
Java
92.03
2000 Maximum Entropy 2001 SupportVector Machine
MaxEnt
Java
91.97
YamCha
Perl,C++, Python
94.39
2002 Memorybased
TiMBL
Python
92.5
2002 HiddenMarkov Model
N/A
N/A
92.19
2003 Conditional RandomFields
CRF++2
C++
94.38
Table2:NPEAccuracyComparison
NotethatthisimplementationwasauthoredbyTakuKudohandYujiMatsumoto
Page|27
4.2
Parsing Technique Nonadaptable Advantages Disadvantages
AdvantagesandDisadvantagesofTechniques
Transformation based
Memorybased
Maximum Entropy
HiddenMarkov Model
Extremelysimpleto Rulesarenotsufficientlycomplete; implement theycannotpredictthesubtletiesand exceptionsthatoccurnaturallyin Doesnotrequiretraining language corpus Verydifficulttogeneratenewrules, withouttheskillsofanaccomplished linguist Memoryintensive;requireslarge Thefirstmachine amountsofstorageforrulesequences learningtechniquetobe applied;therefore,there Computationallycomplex;constantly areseveral,welldone probingthedatabaseforappropriate implementations rulestoapply WellsuitedtotheNLP Memoryintensive;storesevery task;achieveshighlevels grammaticalconfigurationthatit ofaccuracy encounters;Failstorecognizefeature dependenciesandthereforestores Relativelysimple manymorerulesthanarenecessary techniquetoimplement Noabilitytoweightimportantfeatures; indiscriminatelystoresandappliesall configurations;maypreferan anomalousconfigurationtoan establishedone Doesnottakeintoaccountthecontext Veryeasytoadaptand ofthephrase;examinesonlythe implement;becauseitis currentwordandmakesalocal averygeneralized, decision statisticaltechnique, thereareseveral establishedtoolsoutside ofNLPwhichcanbe adapted Takespositioninthe Makesindependenceassumptions sentenceintoaccount; whichcauseittoignorespecialinput recognizeswhereina featuressuchassuffixes,capitalization sentencenounphrases andcontext aremorelikelytooccur Page|28
Conditional RandomFields
SupportVector Machines
Canoverfitthetrainingdata(Smith); Improvedversionofan thetrainingsetcanbecometoospecific HMM;CRFsdotake tothetrainingdataandcantake contextintoaccountand performancehits(However,thiseffect thereforeachievesa muchhigheraccuracy canbemitigatedbycarefully controllingtheindividualweighting factors) Achievesthehighest Canbeaverycomplexsystemto levelofaccuracy implement Veryquickruntimes becauseittakesveryfew operationstoplota configurationinthe solutionspace;thenthe answerisimmediately known
4.3
DataAbstractionComparison
Anotherwaytoevaluateeachofthemachinelearningtechniquesisintermsoftheir dataabstraction.Atechniquewhichdoesnotexhibitdataabstractionisextremely detaildriven;itoverfitsthetestdataset,creatingruleswhichmightperformvery wellonsimilarsets,butpoorlyonamoregeneralset.Considertrainingatoollike thisonamedicalcorpus.Itmaybecomeverygoodatidentifyingtheexactkindsof sentencestructuresthatoccurinthistypeoftext,butitwouldthenperformvery badlyifappliedtotheWallStreetJournal,forexample.Atoolwithsignificantdata abstraction,however,maybeabletoextrapolatemoregeneralruleswhichmay performbetterunderthiskindofcircumstance.Forthisreason,thisseemstobean indicatorofhowmuchthemachinelearnerhasreallylearned;thetoolisableto makemoreeducatedguesses,ratherthanjustregurgitatingpastinstancesthatit
Page|29
hasencountered.IthinkthatDaelemansetal.saiditwellwhentheytermedtheir ownmemorybasedparseralazylearner.Thetechniqueswhicharelessabstract willultimatelybeoutperformedonunknowndatasetsordatasetsofadiffering genre.Ultimatelythemoreabstractthetechniqueis,themorerobustitis;thatis, themorecapableitisofhandlingerrors.Thefollowingillustrationrepresentsmy approximation: AppendixIcontainsacompleteclassificationofallnounphraseextractorsIcould find,todate.Aglossaryofallabbreviationsusedinthispapercanbefoundin AppendixII.

Figure13:DataAbstractionDiagram
Page|30
WorksCited
Abney,Steven."ParsingbyChunks."Berwick,Robert,StevenAbneyandCarolTenny. PrincipleBasedParsing.Dordrecht:KluwerAcademicPublishers,1991.118. ."PartialParsingviaFiniteStateCascades."WorkshoponRobustparsing,8thEuropean SummerSchoolinLogic,LanguageandInformation.Prague,1996.815. ACLSIGLEX.ACLSIGLEXResourceLinks.1332008.SpecialInterestGroupontheLexiconof theAssociationforComputationalLinguistics.2032008 <http://www.clres.com/corp.html>. Argamon,Shlomo,IdoDaganandYuvalKrymolowski."AMemoryBasedApproachto LearningShallownaturalLanguagePatterns."17thInternationalConferenceon ComputationalLinguistics.Montreal,Quebec:AssociationforComputational Linguistics,1998.6773. Berger,Adam,VincentDellaPietraandStephenDellaPietra."AMaximumEntropy ApproachtoNaturalLanguageProcessing."ComputationalLinguisticsMarch1996: 3971. Bourigault,Didier."SurfaceGrammaticalAnalysisfortheExtractionofTerminologicalNoun Phrases."14thConferenceonComputationalLinguistics.Nantes,France:Association forComputationalLinguistics,1992.977981. Brill,Eric."TransformationBasedErrorDrivenLearningandNaturalLanguageProcessing:A CaseStudyinPartofSpeechTagging."ComputationalLinguistics1995:543565.
Page|31
Buckley,Chirs,etal."TheSmart/EmpireTIPSTERIRSystem."AssociationforComputational LinguisticsWorkshop.Balitmore,Maryland:AssociationforComputational Linguistics,1998.107121. Cardie,ClaireandDavidPierce."ErrorDrivenPruningofTreebankGrammarsforBaseNoun PhraseIdentification."17thInternationalConferenceonComputationalLinguistics. Montreal,Quebec,Canada:AssociationforComputationalLinguistics,1998.218 224. Chen,KuanghuaandHsinHsiChen."ExtractingNounPhrasesfromLargeScaleTexts:A HybridApproachanditsAutomaticEvaluation."32ndAnnualMeetingon AssociationforComputationalLinguistics.LasCruces,NewMexico:Associationfor ComputationalLinguistics,1994.234241. Cheng,D.,etal."Adivideandmergemethodologyforclustering."24thACMInternational ConferenceonPrinciplesofDatabaseSystems.2005.196204. Church,Kenneth."AStochasticPartsProgramandNounPhraseParserforUnrestricted Text."SecondConferenceonAppliedNaturalLanguageProcessing.Austin,Texas: AssociationforComputationalLinguistics,1988.136143. Daelemans,Walter,SabineBuchholzandJornVeenstra."MemoryBasedShallowParsing." ConferenceonNaturalLanguageLearning,1999. Dejean,Herve."LearningSyntacticStructureswithXML."2ndWorkshoponLearning LanguageinLogicandthe4thConferenceonComputationalNatuaralLanguage Learning.Lisbon,Portugal:AssociationforComputationalLinguistics,2000.133135.
Page|32
Dy,JenniferandCarlaBrodley."FeatureSelectionforUnsupervisedLearning."TheJournal ofMachineLearningResearch5(2004):845889. Hindle,DonaldandMatsRooth."StructuralAmbiguityandLexicalRelations." ComputationalLinguistics:SpecialIssueonusingLargeCorporaMarch1993:103 120. Hindle,Donald."NounClassificationfrompredicateArgumentStructures."28thAnnual MeetingonAssociationforComputationalLinguistics.Pittsburgh,Pennsylvania: AssociationforComputationalLinguistics,1990.268275. Joho,HideoandMarkSanderson."RetrievingDescriptivePhrasesfromLargeAmountsof FreeText."NinthInternationalConferenceonInformationandKnowledge Management.McLean,Virginia:ACMPress,2000.180186. Koeling,Rob."ChunkingwithMaximumEntropyModels."2ndWorkshoponLearning LanguageinLogicandthe4thConferenceonComputationalNaturalLanguage Learning.Lisbon,Portugal:AssociationforComputationalLinguistics,2000.139141. Kudo,TakuandYujiMatsumoto."ChunkingwithSupportVectorMachines."2ndMeetingof theNorthAmericanChapteroftheAssociationforComputationalLinguisticson LanguageTechnologies.Pittsburgh,Pennsylvania:AssociationforComputational Linguistics,2001.18. ."UseofSupportVectorLearningforChunkIdentification."2ndWorkshoponLearning LanguageinLogicandthe4thConferenceonComputationalNaturalLanguage Learning.Lisbon,Portugal:AssociationforComputationalLinguistics,2000.142144.
Page|33
Kumaran,GiridharandJamesAllan."UsingNamesandTopicsforNewEventDetection." ConferenceonHumanLanguageTechnologyandEmpiricalMethodsinNatural LanguageProcessing.Vancouver,BritishColumbia,Canada:Associationfor ComputationalLinguistics,2005.121128. Liu,HuanandHiroshiMotoda.FeatureSelectionforKnowledgeDiscoveryandDataMining. Norwell,MA:KluwerAcademicPublishers,1998. Marcus,MitchellP.,BeatriceSantoriniandMaryAnnMarcinkiewicz."BuildingaLarge AnnotatedCorpusofEnglish:ThePennTreebank."ComputationalLinguistics.1993. 313330. MengSoon,Wee,HweeTouNgandDanielChungYongLim."AMachineLearningApproach toCoreferenceresolutionofNounPhrases."ComputationalLinguistics:SpecialIssue onComputationalAnaphoraResolutionDecember2001:521544. Molina,AntonioandFerranPla."ShallowParsingUsingSpecializedHMMs."Journalof MachineLearningResearch(2002):595613. Ramshaw,LanceandMitchMarcus."TextChunkingusingTransformationBasedLearning." ThirdWorkshoponVeryLargeCorpora.Ed.DavidYarovskyandKennethChurch. Somerset,NewJersey:AssociationforComputationalLinguistics,1995.8294. Sha,FeiandFernandoPereira."ShallowParsingwithConditionalRandomFields." ConferenceoftheNorthAmericanChapteroftheAssociationforComputational LinguisticsonHumanLanguageTechnology.Edmonton,Canada:Associationfor ComputationalLinguistics,2003.134141.
Page|34
Sinclair,John.DevelopingLinguisticCorpora:aGuidetoGoodPractice.2004.Artsand HumanitiesDataService.2032008 <http://www.ahds.ac.uk/creating/guides/linguisticcorpora/chapter1.htm>. Smith,Andrew.LogarithmicOpinionPoolsforConditionalRandomFields.26June2007.The UniversityofEdinburgh.23March2008 <http://www.era.lib.ed.ac.uk/handle/1842/1730>. Taylor,Lita,ClaireGroverandTedBriscoe."TheSyntacticRegularityofEnglishNoun Phrases."FourthConferenceonEuropeanChapteroftheAssociationfor ComputationalLinguistics.Manchester,England:AssociationforComputational Linguistics,1989.256263. TjongKimSang,Erik."MemoryBasedShallowParsing."TheJournalofMachineLearning Research2(2002):559594. UniversityofPennsylvania.ThePennTreebankProject.221999.PennEngineering.213 2008<http://www.cis.upenn.edu/~treebank/>. Vilain,MarcandDavidDay."PhraseParsingwithRuleSequenceProcessors:anApplication totheSharedCoNLLTask."ProceedingsoftheFourthConferenceonComputational NaturalLanguageLearningandoftheSecondLearningLanguageinLogicWorkshop. Lisbon:AssociationforComputationalLinguistics,2000. Wikipedia.HiddenMarkovModel.12March2008.22March2008 <http://en.wikipedia.org/wiki/Hidden_Markov_model>.
Page|35
AppendixI ClassificationofCurrentImplementations
AppendixII GlossaryofAbbreviations1
NPE NLP NounPhraseExtraction NaturalLanguageProcessing Determiner Adjective Noun Relativeclause Prepositionalphrase Verb VerbPhrase Sentence PartOfSpeech FiniteStateAutomaton NounPhrase LeftRight(ScansinputfromLefttorightGuidedbyRightmostderivations) Beginninganounphrase Insideofanounphrase Outsideofanounphrase HiddenMarkovModel ConditionalRandomField SupportVectorMachine
DET/d ADJ/a NN/n RC/rc PP/pp V VP S POS FSA NP LR B I O
HMM CRF SVM
1 Entriesinthisglossaryaremadeintheorderinwhichtheyappear.

StCharlesW (2008)

Caricato da

Informazioni sul documento

Descrizione originale:

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

StCharlesW (2008)

Caricato da

Copyright:

Formati disponibili

NounPhraseExtraction

3.2.1 3.2.2 3.2.3 3.2.4 3.2.5 3.2.6 4

S0dS1|aS1|nNP|pNP|dNP S1aS1|nNP|pNP NPrcNP|ppNP

Syntacticambiguityisonesuchissuewhichcanthrowoffstaticparsers.Thisoccurs whenasentencecanfeasiblybeinterpreteddifferentways.Forexample,consider followingtwoparsetreesofthesentenceIsawthemanwiththetelescope.

Inthefirsttree,itisclearthatIwaslookingthroughatelescopewhenIsawthe man.ThesecondparsetreeshowsthatIsawamanwhowasholdingatelescope.It isclearhowthesealternativeinterpretationsaffectthenounphraseparsingofthe

Oncealearnerhasbeenappropriatelytrained,itmaythenbeappliedanuntagged versionofacorpusselection,andthencheckedagainstthatsame,taggedportion foraccuracy.UntaggedinthiscasereferstotextwhichhasbeentaggedwithPOS, butnotbracketedtoindicatenounphrases.Itisimportantthatthelearnerbe

Thoughthisalgorithmmaygetfairlymemoryintensiveduringthetrainingperiod,it thenproducesanumberofstaticruleswhichitwillmatchtogivenPOSsequences. Considerthisalgorithmappliedtothefollowingexample:

Thememorybasedlearnerlookstoitsrulesetanddeterminesthatthefollowing threesequenceshavethehighestnumberofoverlappingfeatures,andtherefore theyareaddedtothesetofnearestneighbors:

observations(y1,y2,y3)= {walk,shop,clean} start_probability= {Rainy:0.3,Sunny:0.4,Cloudy:0.3}

Inthecaseofnounphraseextraction,thehiddenpropertyistheunknowngrammar rule.Observationsareformedbythetrainingdata.Contextualprobabilities representthetransitionstates;thatis,givenourprevioustwotransitions,whatis thelikelihoodofcontinuing,ending,orbeginninganounphrase/P(oi|oj1,oj2).

Year ParsingMethod 1995 Static,Rule based

Implementation Title Alembic

Language Performance Measure LISP 77

1995 Transformation based

Marc Greenwood Chunker

2000 Maximum Entropy 2001 SupportVector Machine

2002 HiddenMarkov Model

2003 Conditional RandomFields

Parsing Technique Nonadaptable Advantages Disadvantages

DET/d ADJ/a NN/n RC/rc PP/pp V VP S POS FSA NP LR B I O

HMM CRF SVM

Potrebbero piacerti anche