Sei sulla pagina 1di 38

NounPhraseExtraction

AnEvaluationandDescriptionofCurrentTechniques
WhitneySt.Charles DepartmentofComputerScience TheUniversityofTennesseeatChattanooga 615McCallieAve. Chattanooga,TN37403

Page|1

TableofContents
TableofContents.......................................................................................................... 2 1 2 INTRODUCTION..................................................................................................... 3 BACKGROUND....................................................................................................... 4 2.1 3 TheNounPhraseExtractionTask................................................................... 4

Methods................................................................................................................ 7 3.1 3.2 Static,NonadaptableParsers........................................................................ 7 MachineLearningTechniques ..................................................................... 12 TransformationBasedLearners ........................................................... 15 MemoryBasedLearners....................................................................... 17 MaximumEntropyLearners ................................................................. 19 HiddenMarkovLearners....................................................................... 20 ConditionalRandomFieldLearners...................................................... 23 SupportVectorMachineLearners........................................................ 24

3.2.1 3.2.2 3.2.3 3.2.4 3.2.5 3.2.6 4

AnalysisandConclusions .................................................................................... 26 4.1 4.2 4.3 NounPhraseExtractionAccuracyComparison............................................ 26 AdvantagesandDisadvantagesofTechniques............................................ 28 DataAbstractionComparison ...................................................................... 29

WorksCited................................................................................................................. 31

Page|2

INTRODUCTION

NaturalLanguageProcessing(NLP)isconcernedwiththeautomated,computer understandingofhumanlanguage.NounPhraseExtraction(NPE)isarguablyoneof themostcriticalcomponentsofthistask.Forexample,informationretrievalsystems ofallkindsrelyonbasenounphrasesastheirprimarysourceofentityidentification. Becausethisissuchanecessarytasktonaturallanguageprocessing,therearea multitudeofalgorithmsdesignedtohandleit.Idecidedthatathoroughreviewand comparisonofthetopseventechniqueswouldbeofgreatinteresttocomputer scientistsstudyingNLP. ThetechniquesIhaveselectedrepresentsevencompletelydistinctapproachesto thistask:staticparsing,transformationbasedlearning,memorybasedlearning, maximumentropy,hiddenMarkovmodel,conditionalrandomfields,andsupport vectormachines.Thelattersixarevariousmachinelearningtechniques.Eachof theseisbasedonthepremisethatcertainrulescanbegleanedthroughtrainingthe learneronaparticulardataset.GivenaparticularEnglishlanguagecorpus,orbody oftexts,alearnerwilldevelopaparticularsetofprobabilitieswhichwillallowitto formulateitsownunderstandingoftherulesoftheEnglishlanguage.Thispaper reviewseachofthesetechniquesonacommondatasetandtrainedonacommon Englishlanguagecorpus.

Page|3

2
2.1

BACKGROUND
TheNounPhraseExtractionTask

NounphraseextractionliesatthebaseofmanyareasinNaturalLanguage Processing.Sowhatmakesthissuchanimportantsubtask?Tounderstandthis,a deeperunderstandingofnounphrasesandtheirsyntacticsignificanceisrequired. Anounphraseisonewhoseheadisanounorpronoun,andmaybeoptionally accompaniedbyasetofmodifiers.Modifiersincludedeterminers,adjectives, prepositionalphrases,andrelativeclauses.Determinersarecomprisedofarticles (a,the),demonstratives(this,those),numerals(one,two),possessives (his,her),andquantifiers(some,many).Examplesofsomenounphrases include: theDETredADJballNN theDETbooksNN[thatIboughtyesterday]RC theDETmanNN[withtheblackhat]PP Byidentifyingnounphrases,theprimaryentitiesinasentenceareisolated.Noun phrasescontainthekeyactorsinasentencesuchastheman.Nounsintrinsically

Page|4

identifywhataparticularpieceofwritingisabout.Forexample,considerapieceof writingaboutairplanesafety.Itwouldlikelycontainverbssuchasread,exit,or notify.Withonlythesewordstolookat,itwouldbedifficulttoguesswhatthe textwasabout.However,thenounsinthepiecemightincludeplane,seatback, oremergency.Thesearemuchmoreinformativeaboutthesubjectmatterand contextofthetext,andthereforearequitevaluabletotheNLPtask. Atfirstglancethismayseemlikeafairlysimpletask.Englishinparticularhas apparentlyunambiguousrulesfordefiningnounphrases.TheNLPsubtaskofParts OfSpeech(POS)tagginghasalreadybeenaddressedwithlargesuccess(Brill). Consequentlybecausethepartsofspeechofeachwordcanbedetermined,the taskofcompoundingtheseintonounphrasesisafairlysimpleone.Isthisenough, however?Considerthefollowingexample:ThemanwhoseredhatIborrowed yesterdayinthestreetthatisnexttomyhouselivesnextdoor.Usingtherules listedabove,thatsentencewouldbeparsedasfollows: [TheDETmanNN[whoseredhat[Iborrowedyesterday]RC]RC[inthestreet]PP [thatisnexttomyhouse]RC]NPlives[nextdoor]NP. Thetwonounphrasesthatareidentifiedhereare,infact,correct,buttheyarenot irreducible.Bydefinition,nounphrasesareextendedbybothrelativeclauses
Page|5

(delineatedherebyRC)andprepositionalphrases(PP).Thisparsingcompletely ignoresthemostbasenounphrasescontainedinthesentence,however.Amore usefulparsingmaybethefollowing: [[Theman]NPwhose[redhat]NPIborrowed[yesterday]NPin[thestreet]NP thatisnextto[myhouse]NP]NPlives[nextdoor]NP Allofthemajorentitiesinthissentencehavenowbeenidentified.Thisis especiallyusefulinthedocumentclusteringNLPtask,forexample,inwhich documentsareclusteredintomeaningfulcategorieswhichareindicativeofsubject matter.Inordertoachievethisgrouping,thedocumentsmustbecomparedtoone anotherintermsoftheirentitycontent.Ifthenounphrasesfromthefirstexample werepulledfromaparticulardocument,thelikelihoodoffindinganotherdocument withthisexactsame,longnounphrasewouldbeverylow.Ifextremelylong, complexnounphrasesareextracted,theiruniquenessnegatesthechanceoffinding matchingphrasesagainsttheotherdocumentsinaset.Eachphrasesimplycontains toomuchinformationtobeuseful.Thisisonereasonwhyirreducibleorbase nounphrasesaremuchmoredesirable.Therearetwopredominantmethodsfor basenounphraseextraction:static,nonadaptableparsersandthosebasedon machinelearningtechniques.
Page|6

3
3.1

Methods
Static,NonadaptableParsers

Staticparsersarethemostuncomplicatedandcomputationallyinexpensivetype.A setofstaticrulesaredefinedtoparsenounphrasesfromtext.Oftentheserulesare describedintermsofagrammarorfinitestateautomaton(FSA).Forexample, considertherule:anarticleprecedinganounisincludedinthenounphrase. Consequently,thismeansthatanounafteranarticlecannotmarkthebeginningof anounphrase.Thefollowingexampleexpressesthisruleaswellasthegeneral descriptionlistedaboveintermsofafinitestateautomaton1:

Figure1:Simple,FSAnounphraseidentifier

Notethatthisisanextremelysimple,nonrobustFSA,meantonlytoillustratetheconcept.Itisin nowayaworkableparser
1

Page|7

TheinitialstateisstateS0andasthephraseistraversed,transitionsaremade accordingly.Forexample,considerthephrase: thecrankymanwiththeredhatthatIknowfromschool ThedeterminertheallowsustotakethefirsttransitiontoS1.Crankyisan adjective,thereforeweloopbackandstayinS1.Manisanoun;thereforewe transitionintotheacceptingstateandrecognizethatwehaveanounphrase.With theredhatisaprepositionalphraseandthatIknowfromschoolisarelative clause,thereforewecontinuetostayintheacceptingstate,NP. Oncetheinitialgrammaticalruleshavebeenformulated,thereareseveralwaysto realizetheactualparser.Afinitestateautomatonliketheoneaboveisusually expressedbyusingaregulargrammar(Abney,PartialParsingviaFiniteState Cascades).Forexample,theFSAabovecouldalsobeconveyedthefollowingway:

S0dS1|aS1|nNP|pNP|dNP S1aS1|nNP|pNP NPrcNP|ppNP


Figure2:Aregulargrammar,expressingtheFSAshowninFigure1

Page|8

Intheregulargrammarabove,thesetofnonterminals(V)consistsofthesetofFSA states:{S0,S1,NP}.Theterminalsinthiscasearethepartsofspeechsuchasa,n,p, anddwhichareadjectives,nouns,pronouns,anddeterminersrespectively.To simplifythismachineIhavedesignatedrcandppasterminals,butassumethatthey correctlyidentifybothrelativeclausesandprepositionalphrasesrespectively. StevenAbneystechnique,basedonthisprinciple,wasabletoachieverelatively highaccuracy(Abney,PartialParsingviaFiniteStateCascades). Oncearegulargrammarlikethisoneisformulated,anotherpossible implementationinvolvesanLRparser.AnLRparserreadsinputfromlefttoright, andisguidedbyrightmostderivations.Considerthefollowingparsetreeforthe sentence:Ieatcookies.
1.) 2.) 3.) 4.)

Productions
S NPVP NP DetN NP N VP VNP


Figure3:LRParseTreeandContextFreeGrammar

Page|9

ThisparsetreeisassembledbytheLRparserfromthebottomup,andderivations aremadefromrighttoleft.Iisthefirstinputthatisread;noticethatitisnot cookiesbecauseinputisreadfromlefttoright,insteadofrighttoleft.However, ateachstepalongtheway,rightmostderivationsaretaken.WhenIistheonly inputread,ofcourseitistherightmostandthereforeproductionnumber3canbe used.Thenbotheatandcookiesarereadbeforethenextderivationcanbe made;nowproduction3whichtakescareofcookiesischosen,becauseitisthe rightmost.Nowproduction4canbetaken,whichclumpseatcookiesintoaverb phrase.Inthisway,theparsercanbuildthecorrectparsetreefromthebottomup. Aparserusingtheserules,andseveralotherswasoneoftheearlyattemptsatthe nounphraseextractiontask(Abney,ParsingbyChunks). Theeffectivenessofthesetechniquesisentirelydependentuponthelinguists formulatingtheruleset.Withanextremelyaccuratesetofrules,nounphraseswill becorrectlyidentifiedmoreofthetime.However,whilewelearngrammarin schoolasaseeminglysimple,unvaryingsetofrules,therearesomanyexceptions thattheyarentaseasilyapplicableasonemightthink.Infact,itisverydifficultto formulateanaccuratesetofrulestodefineanounphrase.Thesubjectivenatureof languagemeansthattherearesimplytoomany,unanticipated,andundefinedrules.

Page|10

Syntacticambiguityisonesuchissuewhichcanthrowoffstaticparsers.Thisoccurs whenasentencecanfeasiblybeinterpreteddifferentways.Forexample,consider followingtwoparsetreesofthesentenceIsawthemanwiththetelescope.

Figure4:Twoviable,alternativeparsetreesforthesentence

Inthefirsttree,itisclearthatIwaslookingthroughatelescopewhenIsawthe man.ThesecondparsetreeshowsthatIsawamanwhowasholdingatelescope.It isclearhowthesealternativeinterpretationsaffectthenounphraseparsingofthe


Page|11

sentence.Inthefirstexample,thenounphraseconsistsonlyoftheman,whereas inthesecondexample,thephraseincludeswiththetelescope.Withoutmore informationaboutthecontextofthissentence,therewouldbenowaytoconclude whichparsingiscorrect.Anonadaptableparserwouldsimplypickoneofthese parsetreesortheother,everytimeitcameacrossthisparticulargatheringofparts ofspeech.Syntacticambiguitylikethisoccurssooften,thatthisbecomesamajor failingpointforstaticparsers.

3.2

MachineLearningTechniques

Machinelearningmethodsaimtocorrectmanyoftheshortcomingsofthesimple extractors.Theyhavetheabilitytodiscoverthoseunspokenrulesoflanguage.The moreexamplesalearnersees,themorelikelyitistodiscovernewrelationships. Withthisinformation,thelearnerisabletoformulateandupdaterules.These techniquesarebasedprimarilyonprobabilityandstatistics.Dependinguponhow wellagivenruleperformsonaparticulardataset,certainweightsareassigned. Learnerscanevenaddressproblemslikestructuralambiguitybydiscoveringthe contextofasentence.GivenvariousPOSconfigurations,amachinelearnercan formulatestatisticsaboutwhatisthemostlikelynounphraseparsetree. Trainingisanintegralpartofanymachinelearner.Innaturallanguageprocessing, weoftenrelyonalarge,representativeEnglishlanguagecorpusonwhichtotrain.
Page|12

ThiscorpuscontainsalargenumberofEnglishwordsaswellasmanydifferent usagesofthosewords.Sentencesaretaggedbyhandwithbothpartsofspeechand nounphraseoccurrences.Inthecomputationallinguisticsense,acorpusisalarge, structuredsetoftextswhichismeanttobearepresentativesampleofaparticular genreorlinguisticpattern(Sinclair).OftenfortheNPEtask,itismostadvantageous totrainonanextremelygeneralcorpus,thatis,onethatisrepresentativeofthe Englishlanguageasawhole.ConsideracorpusconsistingonlyofShakespeares collectedwritings,forexample.ThesentencestructuresusedbyShakespeareare ratherunique;alearnertrainedonthiscorpusmightperformverywellwhen parsingShakespeare,orperhapsevenvariousothersixteenthcenturypoets. However,itmightperformverypoorlyifappliedtoMySpaceblogs,medicaltexts,or lawbriefs.Thisisbecauseeachofthesegenrescontain,largenumbersof particularmicrolinguisticfeaturessuchaspropernounsorpassiveverbphrases (Sinclair).Therefore,agoodEnglishlanguagecorpusattemptstoencompassallof thelinguisticvariationsinappropriateproportions.NearlyallresearchersofNPE recognizethePennTreebankasthemostappropriateandcompletecorpusfortheir purpose. ThePennTreebankisanannotatedcorpuscontainingover4.5millionEnglish languagewords(Marcus,SantoriniandMarcinkiewicz).Itcontainslargeselections fromTheWallStreetJournal.Nounphrases,prepositionalphrases,andmanyother
Page|13

grammaticalpatternsaretaggedinthePennTreebank.Thisenablesthemachine learnerstobetrainedonknowndata.Becauseeachofthealgorithmsdiscussedin thispaperhavebeentrainedonthePennTreebank,theyareallexamplesof supervisedlearners.Thismeanssimplythattheyeachhaveaccesstocorrectdata, andlearnbymappingcorrectinputstodesiredoutputs.Anunsupervisedlearner wouldnothaveaccesstocorrectdata,butinsteadwouldbeforcedtolearnbased onsomekindofrewardsystem.Unsupervisedlearnersmightbeespeciallyusefulto scientistslookingtodiscovernewinformationaboutaparticulartopic,buttheyare inefficientandunneededwhenitcomestotaskslikegrammarclassification. Furthermore,thesealgorithmsareallinductive,becausetheyinducecertainrules fromcorrectexamples.ThefollowingisasmallexcerptfromthePennTreebank:

Figure5:AsmallsamplefromthePennTreebank

Oncealearnerhasbeenappropriatelytrained,itmaythenbeappliedanuntagged versionofacorpusselection,andthencheckedagainstthatsame,taggedportion foraccuracy.UntaggedinthiscasereferstotextwhichhasbeentaggedwithPOS, butnotbracketedtoindicatenounphrases.Itisimportantthatthelearnerbe


Page|14

testedonacorpusorselectionthathasnotbeenpartofthetrainingsetatany point.Otherwise,rulesgleanedfromthatselectionwouldbecertaintoensurehigh accuracyonthetestset.Thoughitisnotaforegoneconclusionthatalearner trainedonaparticularportionofacorpuswouldhaveonehundredpercent accuracyandrecallonthatsameportion,itisstillgoodscientificmeasuretoteston otherdatasets,tonegatethepossibility. 3.2.1 TransformationBasedLearners OneofthefirstmachinelearningtechniquesappliedtotheNPEtaskwasthe transformationbasedtechniquepioneeredbyLanceRamshawandMitchMarcusin 1995(RamshawandMarcus).ThoughthistechniquewasinitiallycoinedbyEricBrill forthePOStaggingtask(Brill),ithasbeenapersistentandeffectivemeansofNPE parsing.Transformationbasedlearningisanerrordrivenapproach.Thismeans thatitconstantlygeneratesnewrulesasitisbeingtrained,anditthenscoresthose rulesindividually.Thealgorithmwhichgeneratestheseisfairlysimple:

Figure6:Memorybasedlearningalgorithm

Page|15

Thoughthisalgorithmmaygetfairlymemoryintensiveduringthetrainingperiod,it thenproducesanumberofstaticruleswhichitwillmatchtogivenPOSsequences. Considerthisalgorithmappliedtothefollowingexample:

Figure7:Transformationbasedlearningexample

EachofthefirstthreerulesshowninFigure7wouldmostlikelyscorequitehighly. Thelastrule,however,iscompletelywrong,however.Nounscaneasilycomeafter verbswithoutbeingpartofanounphrase;theymayinsteadbepartofaverb phrase,forexample.Butthisrulewilllikelyscorepoorlyforthisreason,and thereforeitwillbeeliminatedfromtheruleset.Noticealsothattherulescan encompassbothwordsbeforeandafterthewordinquestion.Inthisway,machine learnerscanhandlecontext.ThetransformationbasedparserbyRamshawand

Page|16

Marcuscanlookuptotwotagsbeforeandtwotagsafterthecurrentwordinorder toformulaterules. 3.2.2 MemoryBasedLearners Memorybasedlearnersclassifydataaccordingtosimilaritiestootherdataobserved earlier.Thisissimilarinsomewaystohowahumanoranimalmightmakedecisions aboutanunknownquantity.Forexample,ifIweretoobserveaglowing,redcoilon astovetopwithouteverhavingseenonebefore,Imightextrapolatethatitishot basedonitsvisualsimilaritytootherhotthingsIhaveobservedinthepast,suchas theglowingembersinafire.Amemorybasedlearnerdoesthistoanextreme degree;thatis,itrememberseverythingthatithaseverencountered.This property,termedaslazylearning,isuniquetomemorybasedlearners,asmost othertechniquesrememberonlyasmallsubsetoftheconfigurationstheymight encounter.Machinelearningtechniqueswhichabstractknowledgefromthe trainingdataandforgeteverythingelseareknownasgreedylearners(Daelemans, BuchholzandVeenstra).

Page|17

Figure8:Memorybasedlearningalgorithm

Theconceptofdeterminingdistancebetweenvaryinggrammaticalconfigurations canbedifficult.Thoughthereareseveralcomplextechniquesforaccomplishing this,themethodusedbytheNPEparsersexploredinthispaperisanoverlapping function;thisfunctionsimplycalculatesthenumberofoverlappingfeaturesthatan instancehaswithanotherinstanceoccurringinmemory.Forexample,considerthe followingcandidatesequence: theDETbeautifulADJ,talentedADJcollegeNNstudentNN

Thememorybasedlearnerlookstoitsrulesetanddeterminesthatthefollowing threesequenceshavethehighestnumberofoverlappingfeatures,andtherefore theyareaddedtothesetofnearestneighbors:


Table1:NearestNeighborList

DETADJNNNN DETADJADJNNNNP

NP NP NotanNP

VDETADJNNNN

Page|18

Themajorityofthenearestneighborsareclassifiedasnounphrases,thereforethe newconfiguration(DETADJADJNNNN)isclassifiedasanounphrase,also,and addedintomemory. 3.2.3 MaximumEntropyLearners Maximumentropylearnersuseastatisticalmethodtopredictwhichpossibilityis themostlikely,thatis,whetherthecurrentwordbegins,fallswithin,orfallsoutside anounphrase.Theyworkontheassumptionthattheprobabilitydistributionwhich maximizesinformationentropy,themeasureofuncertaintyassociatedwitha randomvariable,isalsotheleastbiased.Considerthatwehavemunique propositions.Themostinformativedistributionisoneinwhichweknowoneofthe propositionsistrue;informationentropyis0.Theleastinformativedistributionis oneinwhichthereisnoreasontofavoranyonepropositionoveranother; informationentropyislogm(Berger,DellaPietraandDellaPietra).Tobetter understandthisprinciple,takeintoaccountthefollowingexample: Thetagconfiguration(DETADJNN)beginsanounphrasesomeofthetime(B),falls inthemiddleofanexistingnounphrasesomeofthetime(I),andconcludesanoun phrasetheremainingtime(O).
Page|19

P(B)+P(I)+P(O)=1

Tobeginwith,thelearnerassumesthateachoftheseprobabilitiesisevenly distributed. P(B)=P(I)=P(O)=.33 Nowsupposethelearnerdiscoversthatthisphraseeitherbeginsorendsanoun phrasehalfofthetime. Witheachnewlearnedconstraint,theweightsareredistributedtocreatethe flattestpossibleprobabilitydistribution.Thistechniqueisparticularlysimpleto implement,giventhattherearegeneraltoolsthathavebeencreatedtoapplythis algorithm,suchasMaccent(Koeling). 3.2.4 HiddenMarkovLearners TheselearnersarebasedontheHiddenMarkovModel(HMM).Thisstatistical modelcanbeappliedaslongasthesystembeingmodeledpossessesthehidden Markovproperty.AsystemwiththehiddenMarkovpropertymusthaveadiscrete numberofpossiblestates,andtheprobabilitydistributionoffuturestatesmust dependonlyonthepresentstateandbecompletelyindependentofpaststates. P(B)=P(O)=.25 P(I)=.5

Page|20

Thesestatesarenotdirectlyobservable.ConsiderthefollowingMarkovsystem (Wikipedia): Youhaveafriendwholivesfaraway.Youcallhimdailytohearwhathedidthat day,whichisoneofthreeactivities,walking,cleaning,orshopping.Youknowthat hisactivitiesarelinkedtotheweather,whichcouldbeinoneoftwostates,rainyor sunny.Thoughyoucannotdirectlyobservetheweather,youcanattempttoguess whatitmayhavebeenlike,giventheactivitiesthatyourfrienddidthatday.Inthis system,theweatherpossessesthehiddenMarkovproperty.Thissystempossesses certaintransitionprobabilities;forexample,thelikelihoodthatitwillcontinueto rainifthecurrentconditionisRainyis70%,whilethechancesthatitwillsuddenly becomesunnyareonly30%.Youalsoknowthattherearecertainemission probabilities;ifitissunny,youknowthatthereisa60%chancehewillwalk,a30% chancehewillshop,andonlya10%chancethathewillclean.Thissystemisfully describedbelow.

Page|21

states(x1,x2,x3)=

{Rainy,Sunny,Cloudy}

observations(y1,y2,y3)= {walk,shop,clean} start_probability= {Rainy:0.3,Sunny:0.4,Cloudy:0.3}

transition_probability(a12,a21,a23)={ Rainy:{Rainy:0.4,Sunny:0.3,Cloudy:0.3}, Sunny:{Rainy:0.4,Sunny:0.5,Cloudy:0.1}, Cloudy:{Rainy:0.4,Sunny:0.3,Cloudy:0.3}} emission_probability(b1,b2,b3)={ Rainy:{walk:0.1,shop:0.4,clean:0.5}, Sunny:{walk:0.6,shop:0.3,clean:0.1}, Cloudy:{walk:0.4,shop:0.1,clean:0.5}}

Figure9:HiddenMarkovModelProbabilisticParameters

Inthecaseofnounphraseextraction,thehiddenpropertyistheunknowngrammar rule.Observationsareformedbythetrainingdata.Contextualprobabilities representthetransitionstates;thatis,givenourprevioustwotransitions,whatis thelikelihoodofcontinuing,ending,orbeginninganounphrase/P(oi|oj1,oj2).


Page|22

Emissionoroutputprobabilitiesarethechancesthatthecurrentwordwillbegin, continue,orendanounphrasegiventhecurrentstate. 3.2.5 ConditionalRandomFieldLearners ConditionalRandomField(CRF)machinelearnersareactuallyquitesimilarto HiddenMarkovlearners.TheprimaryadvantageofCRFsoverhiddenMarkov modelsistheirconditionalnature,resultingintherelaxationoftheindependence assumptionsrequiredbyHMMs.ThetransitionprobabilitiesoftheHMMhave beentransformedintofeaturefunctionsthatareconditionalupontheinput sequence. YicouldbeB,I,Ointhefollowingexample:

Figure10:CRFProbabilisticParameters

Inthisundirectedgraphicalmodel,eachvertexrepresentsarandomvariablewhose distributionistobeinferred,andeachedgerepresentsadependencybetweentwo
Page|23

randomvariables.InaCRF,thedistributionofeachdiscreterandomvariableYin thegraphisconditionedonaninputsequenceX. 3.2.6 SupportVectorMachineLearners Supportvectormachines(SVM)arethemostcomplexmachinelearningtechnique exploredinthispaper,howevertheyarealsothemostaccurateand computationallyinexpensive.SVMsarebinaryclassifiers.Thismeansthattheyare bestusedtoseparateoneclassofitemsfromanother.IntheNPEcase,theycanbe usedtoseparatenounphrasesfromnonnounphrases.Thisisdonebyrepresenting eachofthepossiblegrammaticalconfigurations,suchas(NNPV)inthesetasan orderedpair,ororderedntuple.Thisconfigurationisthenmappedintoann dimensionalspace,wherenisthenumberoffeaturesthattheconfigurationsshould becomparedon(suchaslengthoramountofoverlap).Asitismapped,itisalso assignedacolor,suchasredorblue,whichrepresentswhetherornotthe configurationisanounphrase.Oncethismappinghasbeendoneforallitemsinthe trainingset,thesystemattemptstoseparatethereditemsfromtheblueones, usingann1dimensionalhyperplane.Simplyput,thismeansthatifthesystem graphsintotwodimensionalspace,thefeaturesareseparatedbyaline;ifthe systemgraphsintoathreedimensionalspace,thefeaturesareseparatedbyaplane. Figure10illustratessupportvectorsin2dimensionalspace.Itshouldbenotedthat

Page|24

anSVMwillalwaysattempttomaximizethemarginbetweenthetwoclassesof pointstoensurethemostgeneralization.

Figure11:SupportVectorsin2dimensionalspace(KudoandMatsumoto,ChunkingwithSupport VectorMachines)

Onceagraphofthefeaturespaceisknownitcanthenbeappliedtonewdata.As newphrasesarereadintotheSVM,theyareplottedaccordingly.Thephrasesmust fallononesideofthelineortheother,therebyclassifyingthemeitherasnoun phrasesornot.ThoughanSVMcanhandleliterallymillionsofconfigurations,itis stilloneoftheleastcomputationallyexpensive.Thisisbecauseeachconfiguration issimplyrecognizedasanorderedpair.Thereisnolookupwhichtakesplaceduring theparsingofthetexteither.Eachphraseissimplymappedintothespaceanda yesornoanswerisimmediatelydelivered.

Page|25

AnalysisandConclusions

Eachofthesetechniqueswastrainedandtestedonacommondatasetforthe CONLL(ConferenceonNaturalLanguageLearning)2000sharedtask.Thisdatasetis aselectionfromthePennTreebank.TheperformancemetricusedisFmeasure. Thismeasuretestsboththeprecisionandrecallofagiventechnique.Inthis equation,referstotheweightofprecisiontorecall,whichisalways1forthe experimentsexploredbythispaper.

Figure12:DefinitionsforPrecision,Recall,andFMeasure

4.1

NounPhraseExtractionAccuracyComparison

Thefollowingtablecontainsachronologicallistingoftheforemostimplementations ofthetechniquesexploredbythispaper.
Page|26

PrimaryWork MarcVilain, DavidDay(Vilain andDay) LanceRamshaw, MitchMarcus (Ramshawand Marcus) RobKoeling (Koeling) TakuKudoh, YujiMatsumoto (Kudoand Matsumoto, Chunkingwith SupportVector Machines) ErikTjongKim Sang(TjongKim Sang) AntonioMolina FerranPla (MolinaandPla) Feisha, Fernando Pereira(Shaand Pereira)

Year ParsingMethod 1995 Static,Rule based

Implementation Title Alembic

Language Performance Measure LISP 77

1995 Transformation based

Marc Greenwood Chunker

Java

92.03

2000 Maximum Entropy 2001 SupportVector Machine

MaxEnt

Java

91.97

YamCha

Perl,C++, Python

94.39

2002 Memorybased

TiMBL

Python

92.5

2002 HiddenMarkov Model

N/A

N/A

92.19

2003 Conditional RandomFields

CRF++2

C++

94.38

Table2:NPEAccuracyComparison

NotethatthisimplementationwasauthoredbyTakuKudohandYujiMatsumoto

Page|27

4.2

Parsing Technique Nonadaptable Advantages Disadvantages

AdvantagesandDisadvantagesofTechniques

Transformation based

Memorybased

Maximum Entropy

HiddenMarkov Model

Extremelysimpleto Rulesarenotsufficientlycomplete; implement theycannotpredictthesubtletiesand exceptionsthatoccurnaturallyin Doesnotrequiretraining language corpus Verydifficulttogeneratenewrules, withouttheskillsofanaccomplished linguist Memoryintensive;requireslarge Thefirstmachine amountsofstorageforrulesequences learningtechniquetobe applied;therefore,there Computationallycomplex;constantly areseveral,welldone probingthedatabaseforappropriate implementations rulestoapply WellsuitedtotheNLP Memoryintensive;storesevery task;achieveshighlevels grammaticalconfigurationthatit ofaccuracy encounters;Failstorecognizefeature dependenciesandthereforestores Relativelysimple manymorerulesthanarenecessary techniquetoimplement Noabilitytoweightimportantfeatures; indiscriminatelystoresandappliesall configurations;maypreferan anomalousconfigurationtoan establishedone Doesnottakeintoaccountthecontext Veryeasytoadaptand ofthephrase;examinesonlythe implement;becauseitis currentwordandmakesalocal averygeneralized, decision statisticaltechnique, thereareseveral establishedtoolsoutside ofNLPwhichcanbe adapted Takespositioninthe Makesindependenceassumptions sentenceintoaccount; whichcauseittoignorespecialinput recognizeswhereina featuressuchassuffixes,capitalization sentencenounphrases andcontext aremorelikelytooccur Page|28

Conditional RandomFields

SupportVector Machines

Canoverfitthetrainingdata(Smith); Improvedversionofan thetrainingsetcanbecometoospecific HMM;CRFsdotake tothetrainingdataandcantake contextintoaccountand performancehits(However,thiseffect thereforeachievesa muchhigheraccuracy canbemitigatedbycarefully controllingtheindividualweighting factors) Achievesthehighest Canbeaverycomplexsystemto levelofaccuracy implement Veryquickruntimes becauseittakesveryfew operationstoplota configurationinthe solutionspace;thenthe answerisimmediately known

4.3

DataAbstractionComparison

Anotherwaytoevaluateeachofthemachinelearningtechniquesisintermsoftheir dataabstraction.Atechniquewhichdoesnotexhibitdataabstractionisextremely detaildriven;itoverfitsthetestdataset,creatingruleswhichmightperformvery wellonsimilarsets,butpoorlyonamoregeneralset.Considertrainingatoollike thisonamedicalcorpus.Itmaybecomeverygoodatidentifyingtheexactkindsof sentencestructuresthatoccurinthistypeoftext,butitwouldthenperformvery badlyifappliedtotheWallStreetJournal,forexample.Atoolwithsignificantdata abstraction,however,maybeabletoextrapolatemoregeneralruleswhichmay performbetterunderthiskindofcircumstance.Forthisreason,thisseemstobean indicatorofhowmuchthemachinelearnerhasreallylearned;thetoolisableto makemoreeducatedguesses,ratherthanjustregurgitatingpastinstancesthatit
Page|29

hasencountered.IthinkthatDaelemansetal.saiditwellwhentheytermedtheir ownmemorybasedparseralazylearner.Thetechniqueswhicharelessabstract willultimatelybeoutperformedonunknowndatasetsordatasetsofadiffering genre.Ultimatelythemoreabstractthetechniqueis,themorerobustitis;thatis, themorecapableitisofhandlingerrors.Thefollowingillustrationrepresentsmy approximation: AppendixIcontainsacompleteclassificationofallnounphraseextractorsIcould find,todate.Aglossaryofallabbreviationsusedinthispapercanbefoundin AppendixII.


Figure13:DataAbstractionDiagram

Page|30

WorksCited
Abney,Steven."ParsingbyChunks."Berwick,Robert,StevenAbneyandCarolTenny. PrincipleBasedParsing.Dordrecht:KluwerAcademicPublishers,1991.118. ."PartialParsingviaFiniteStateCascades."WorkshoponRobustparsing,8thEuropean SummerSchoolinLogic,LanguageandInformation.Prague,1996.815. ACLSIGLEX.ACLSIGLEXResourceLinks.1332008.SpecialInterestGroupontheLexiconof theAssociationforComputationalLinguistics.2032008 <http://www.clres.com/corp.html>. Argamon,Shlomo,IdoDaganandYuvalKrymolowski."AMemoryBasedApproachto LearningShallownaturalLanguagePatterns."17thInternationalConferenceon ComputationalLinguistics.Montreal,Quebec:AssociationforComputational Linguistics,1998.6773. Berger,Adam,VincentDellaPietraandStephenDellaPietra."AMaximumEntropy ApproachtoNaturalLanguageProcessing."ComputationalLinguisticsMarch1996: 3971. Bourigault,Didier."SurfaceGrammaticalAnalysisfortheExtractionofTerminologicalNoun Phrases."14thConferenceonComputationalLinguistics.Nantes,France:Association forComputationalLinguistics,1992.977981. Brill,Eric."TransformationBasedErrorDrivenLearningandNaturalLanguageProcessing:A CaseStudyinPartofSpeechTagging."ComputationalLinguistics1995:543565.

Page|31

Buckley,Chirs,etal."TheSmart/EmpireTIPSTERIRSystem."AssociationforComputational LinguisticsWorkshop.Balitmore,Maryland:AssociationforComputational Linguistics,1998.107121. Cardie,ClaireandDavidPierce."ErrorDrivenPruningofTreebankGrammarsforBaseNoun PhraseIdentification."17thInternationalConferenceonComputationalLinguistics. Montreal,Quebec,Canada:AssociationforComputationalLinguistics,1998.218 224. Chen,KuanghuaandHsinHsiChen."ExtractingNounPhrasesfromLargeScaleTexts:A HybridApproachanditsAutomaticEvaluation."32ndAnnualMeetingon AssociationforComputationalLinguistics.LasCruces,NewMexico:Associationfor ComputationalLinguistics,1994.234241. Cheng,D.,etal."Adivideandmergemethodologyforclustering."24thACMInternational ConferenceonPrinciplesofDatabaseSystems.2005.196204. Church,Kenneth."AStochasticPartsProgramandNounPhraseParserforUnrestricted Text."SecondConferenceonAppliedNaturalLanguageProcessing.Austin,Texas: AssociationforComputationalLinguistics,1988.136143. Daelemans,Walter,SabineBuchholzandJornVeenstra."MemoryBasedShallowParsing." ConferenceonNaturalLanguageLearning,1999. Dejean,Herve."LearningSyntacticStructureswithXML."2ndWorkshoponLearning LanguageinLogicandthe4thConferenceonComputationalNatuaralLanguage Learning.Lisbon,Portugal:AssociationforComputationalLinguistics,2000.133135.
Page|32

Dy,JenniferandCarlaBrodley."FeatureSelectionforUnsupervisedLearning."TheJournal ofMachineLearningResearch5(2004):845889. Hindle,DonaldandMatsRooth."StructuralAmbiguityandLexicalRelations." ComputationalLinguistics:SpecialIssueonusingLargeCorporaMarch1993:103 120. Hindle,Donald."NounClassificationfrompredicateArgumentStructures."28thAnnual MeetingonAssociationforComputationalLinguistics.Pittsburgh,Pennsylvania: AssociationforComputationalLinguistics,1990.268275. Joho,HideoandMarkSanderson."RetrievingDescriptivePhrasesfromLargeAmountsof FreeText."NinthInternationalConferenceonInformationandKnowledge Management.McLean,Virginia:ACMPress,2000.180186. Koeling,Rob."ChunkingwithMaximumEntropyModels."2ndWorkshoponLearning LanguageinLogicandthe4thConferenceonComputationalNaturalLanguage Learning.Lisbon,Portugal:AssociationforComputationalLinguistics,2000.139141. Kudo,TakuandYujiMatsumoto."ChunkingwithSupportVectorMachines."2ndMeetingof theNorthAmericanChapteroftheAssociationforComputationalLinguisticson LanguageTechnologies.Pittsburgh,Pennsylvania:AssociationforComputational Linguistics,2001.18. ."UseofSupportVectorLearningforChunkIdentification."2ndWorkshoponLearning LanguageinLogicandthe4thConferenceonComputationalNaturalLanguage Learning.Lisbon,Portugal:AssociationforComputationalLinguistics,2000.142144.
Page|33

Kumaran,GiridharandJamesAllan."UsingNamesandTopicsforNewEventDetection." ConferenceonHumanLanguageTechnologyandEmpiricalMethodsinNatural LanguageProcessing.Vancouver,BritishColumbia,Canada:Associationfor ComputationalLinguistics,2005.121128. Liu,HuanandHiroshiMotoda.FeatureSelectionforKnowledgeDiscoveryandDataMining. Norwell,MA:KluwerAcademicPublishers,1998. Marcus,MitchellP.,BeatriceSantoriniandMaryAnnMarcinkiewicz."BuildingaLarge AnnotatedCorpusofEnglish:ThePennTreebank."ComputationalLinguistics.1993. 313330. MengSoon,Wee,HweeTouNgandDanielChungYongLim."AMachineLearningApproach toCoreferenceresolutionofNounPhrases."ComputationalLinguistics:SpecialIssue onComputationalAnaphoraResolutionDecember2001:521544. Molina,AntonioandFerranPla."ShallowParsingUsingSpecializedHMMs."Journalof MachineLearningResearch(2002):595613. Ramshaw,LanceandMitchMarcus."TextChunkingusingTransformationBasedLearning." ThirdWorkshoponVeryLargeCorpora.Ed.DavidYarovskyandKennethChurch. Somerset,NewJersey:AssociationforComputationalLinguistics,1995.8294. Sha,FeiandFernandoPereira."ShallowParsingwithConditionalRandomFields." ConferenceoftheNorthAmericanChapteroftheAssociationforComputational LinguisticsonHumanLanguageTechnology.Edmonton,Canada:Associationfor ComputationalLinguistics,2003.134141.
Page|34

Sinclair,John.DevelopingLinguisticCorpora:aGuidetoGoodPractice.2004.Artsand HumanitiesDataService.2032008 <http://www.ahds.ac.uk/creating/guides/linguisticcorpora/chapter1.htm>. Smith,Andrew.LogarithmicOpinionPoolsforConditionalRandomFields.26June2007.The UniversityofEdinburgh.23March2008 <http://www.era.lib.ed.ac.uk/handle/1842/1730>. Taylor,Lita,ClaireGroverandTedBriscoe."TheSyntacticRegularityofEnglishNoun Phrases."FourthConferenceonEuropeanChapteroftheAssociationfor ComputationalLinguistics.Manchester,England:AssociationforComputational Linguistics,1989.256263. TjongKimSang,Erik."MemoryBasedShallowParsing."TheJournalofMachineLearning Research2(2002):559594. UniversityofPennsylvania.ThePennTreebankProject.221999.PennEngineering.213 2008<http://www.cis.upenn.edu/~treebank/>. Vilain,MarcandDavidDay."PhraseParsingwithRuleSequenceProcessors:anApplication totheSharedCoNLLTask."ProceedingsoftheFourthConferenceonComputational NaturalLanguageLearningandoftheSecondLearningLanguageinLogicWorkshop. Lisbon:AssociationforComputationalLinguistics,2000. Wikipedia.HiddenMarkovModel.12March2008.22March2008 <http://en.wikipedia.org/wiki/Hidden_Markov_model>.
Page|35

AppendixI ClassificationofCurrentImplementations

AppendixII GlossaryofAbbreviations1
NPE NLP NounPhraseExtraction NaturalLanguageProcessing Determiner Adjective Noun Relativeclause Prepositionalphrase Verb VerbPhrase Sentence PartOfSpeech FiniteStateAutomaton NounPhrase LeftRight(ScansinputfromLefttorightGuidedbyRightmostderivations) Beginninganounphrase Insideofanounphrase Outsideofanounphrase HiddenMarkovModel ConditionalRandomField SupportVectorMachine

DET/d ADJ/a NN/n RC/rc PP/pp V VP S POS FSA NP LR B I O

HMM CRF SVM

1 Entriesinthisglossaryaremadeintheorderinwhichtheyappear.

Potrebbero piacerti anche