This paper presents a method of improving the quality of subcategorization frames (SCFs) acquired from corpora in order to augment a lexicon of a lexicalized grammar. We first estimate a confidence value that a word can have each SCF, and create an SCF con
Tnx0Vnx1Tnx0Vs1
Tnx0Vnx2nx1Tnx0Vnx1Pnx2Tnx0Vnx1pnx2Tnx0Vplnx1Tnx0VplTnx0Vnx1s2Ts0Vnx1Tnx0Vax1
Tnx0Vplnx2nx1
267(222)38(29)21(16)8(4)5(1)40(23)20(0)11(6)8(1)2(1)1(0)PrecisionRecall0.959(212/221)0.794(212/267)0.357(10/28)0.263(10/38)0.105(6/57)0.286(6/21)0.200(3/15)0.375(3/8)0.024(1/41)0.200(1/5)0.538(7/13)0.175(7/40)na(0/0)0.000(0/20)0.083(1/12)0.091(1/11)0.000(0/2)0.000(0/8)0.000(0/9)0.000(0/2)0.000(0/2)0.000(0/1)PrecisionRecall0.958(253/264)0.948(253/267)0.381(8/21)0.211(8/38)0.185(10/54)0.476(10/21)0.200(2/10)0.250(2/8)0.029(1/34)0.200(1/5)0.667(6/9)0.150(6/40)na(0/0)0.000(0/20)0.200(1/5)0.091(1/11)na(0/0)0.000(0/8)0.000(0/3)0.000(0/2)na(0/0)0.000(0/1)PrecisionRecall0.956(260/272)0.974(260/267)0.323(10/31)0.263(10/38)0.122(9/74)0.429(9/21)0.250(2/8)0.250(2/8)na(0/0)0.000(0/5)0.778(7/9)0.175(7/40)na(0/0)0.000(0/20)0.200(1/5)0.091(1/11)na(0/0)0.000(0/8)0.000(0/1)0.000(0/2)na(0/0)0.000(0/1)
Table2:Precisionandrecallfor400SCFsobtainedfromfreqencycut-off,con dencecut-off0.03,andcentroidcut-off
0.03
loredlexicon.Thecentroidcut-offusingthelexiconboostedprecisionandrecallcomparedtothecon dencecut-offandthecentroidcut-offwithoutthelexicon.
We nallyinvestigateprecisionandrecallofthere-sultingSCFsforeverySCFtypeinordertoevaluateef-fectsofourmethodoneachSCF.Table2showspreci-sionandrecalloftheSCFsbyusingfrequencycut-off(thethresholdfortherelativefrequency0.092),con -dencecut-off0.03(thethresholdforthecon dencevalue0.953),centroidcut-off0.03(thethresholdforthecon -dencevalue0.889)7byusingthresholdsfortherelativefrequencyandthecon dencevaluethatpreserveexactly400SCFs.Thenumbersincurlybracketsin#ofSCFscolumshowthenumberofSCFsinthetestSCFlexiconthatareacquiredfromthetrainingcorpus.TheleftandrightnumbersincurlybracketsintheprecisioncolumnsshowthenumberofcorrectSCFsagainstallSCFsintheresultingSCFlexiconwhilethoseintherecallcolumnsshowthenumberofcorrectSCFsagainstallSCFsinthetestSCFlexicon.Wecanobserveatendencythatthecon dencecut-offandthecentroidcut-offpreservemoretransitive(Tnx0Vnx1)SCF.ThisisbecausesomeSCFsofTnx0Vnx1inthetestSCFlexiconarenotobservedinthetrainingcorpusbutarepredictedbyaprioridis-tributionforSCFTnx0Vnx1.Also,thecentroidcut-offtendstoreduceimplausibleSCFsofTnx0Vnx1Pnx2andTnx0Vax1.Sincethethresholdforthecon dencevalueofthecentroidcut-off0.03(0.889)issmallerthanthatofthecon dencecut-off0.03(0.953),theclusteringcouldeliminateimplausibleSCFswithoutreducingrecall.Inshort,onereasonwhythecentroidcut-offoutper-formsthecon dencecut-off(orthefrequencycut-off)isduetothewayhowthecentroidcut-offeliminateSCFsnotexistedinthelexicon.WhenweeliminateSCFswithlowerrelativefrequencyundertheassumptionthatthoseSCFstendtobewronglyacquiredSCFs,itmustalsoeliminatecorrectSCFswithlowrelativefrequencies.Byusingco-occurrencetendencyamongSCFsasanother
nowordtakesSCFTnx0Vpnx1inthetestSCFlexi-con,weomitithere.
7Since
criteriatojudgetheimplausibilityoftheSCFs,wecaneliminatemorewronglyacquiredSCFsbecausetheytendtoviolatetheco-occurrencetendency.Anotherreasonwhythecentroidcut-offandthecon dencecut-offout-performthethefrequencycut-offisduetothewayhowthosecut-offsaddnewunseenSCFs.Wecanaddplausi-bleSCFsfromthoseSCFswhichisreliableaccordingtotheiraprioridistribution.Furthermore,sincethecentroidcut-offmakesuseoftheco-occurrencetendencyamongSCFs,itaddsonlySCFswhichareplausibleintermsofcorpus-basedstatistics(con dencevalue)underthere-strictionprovidedbytheco-occurrencetendencyamongSCFsinthelexiconofthetargetgrammar.
5ConcludingRemarksandFutureWork
Inthispaper,wepresentedanovelwaytoimprovethequalityofSCFsacquiredfromcorporainordertoaug-mentalexicalizedgrammarwiththem.ByapplyingourmethodtotheacquiredSCFlexiconusingtheXTAGEn-glishgrammar,weshowedthatourmethodimprovedbothprecisionandrecalloftheresultingSCFscomparedtothenaivefrequency-basedcut-off.
Infuturework,wearegoingtoinvestigatethepars-ingperformanceoftheXTAGEnglishgrammaraug-mentedwithSCFsobtainedbyourmethod.Wewillapplyourmethodtolexicalizedgrammarswithrela-tivelysmallerlexicon,e.g.,theLINGOEnglishResourceGrammar(Flickinger,2000).
Acknowledgment
TheauthorswishtothankYoshimasaTsuruokaandTakuyaMatsuzakifortheiradviceonprobabilisticmod-elingofthesetofSCFs,andthankAlexFangforhishelpinusingSCFsacquiredfromthecorpus.TheauthorsarealsoindebtedtoYusukeMiyao,JohnCarrollandthethreeanonymousreviewersfortheirvaluablecommentsonthispaper.The rstauthorwassupportedinpartbyJSPSRe-searchFellowshipsforYoungScientists.