This paper presents a method of improving the quality of subcategorization frames (SCFs) acquired from corpora in order to augment a lexicon of a lexicalized grammar. We first estimate a confidence value that a word can have each SCF, and create an SCF con
Table1:TreefamiliesoftheXTAGEnglishgrammarmappedfrom23outof163SCFtypes
Tnx0Vnx1Tnx0Vs1
Tnx0Vnx2nx1Tnx0Vnx1Pnx2Tnx0Vnx1pnx2Tnx0Vplnx1Tnx0VplTnx0Vnx1s2Tnx0Vpnx1Ts0Vnx1Tnx0Vax1
Tnx0Vplnx2nx1
Transitive
SententialcomplementDitransitive
MultipleanchorditransitivewithPPDitransitivewithPPTransitiveverbParticleIntransitiveverbParticle
SententialcomplementwithNPIntransitivewithPP
TransitivesententialsubjectIntransitivewithadjectiveDitransitiveverbParticle
1
0.8
confidence cut-off 0.01confidence cut-off 0.03confidence cut-off 0.05
0.6Recall
0.4 0.2 0
0 0.2 0.4
Precision
0.6 0.8 1
Inordertoevaluateourmethod,wesplittheSCFlexi-conoftheXTAGEnglishgrammarintothetrainingpor-tionandthetestportion.Thetrainingportionincludes9,427SCFsfor8,399words,whilethetestportionin-cludes433SCFsfor280wordsThetestportionisse-lectedfromtheSCFlexiconforwordsthatareobservedintheacquiredSCFlexicon.WeextractSCFcon dence-valuevectorsfromthetrainingportionandcombinethemwiththeSCFcon dence-valuevectorsobtainedfromtheacquiredSCFs.Thenumberoftheresultingdataobjectsis8,679.5WealsomakeuseoftheSCFcon dence-valuevectorsobtainedfromthetrainingSCFlexiconasanini-tialcentroidbyregardingεas0.Thetotalnumberofthemwas35.6Wethenperformedclusteringofthese8,679dataobjectsinto35clusters.
We nallyevaluateprecisionandrecalloftheresultingSCFsbycomparingthemwiththetestSCFlexiconoftheXTAGEnglishgrammar.
We rstcomparecon dencecut-offwithfrequencycut-offtoinvestigateeffectsofBayesianestimation.Fig-ure4showsprecisionandrecalloftheresultingSCFsetsusingcon dencecut-offandfrequencycut-off.Wemea-suredprecisionandrecalloftheSCFsetsobtainedusingcon dencecut-offwhoserecognitionthresholdt=0.01(con dencecut-off0.01),0.03(con dencecut-off0.03),and0.05(con dencecut-off0.05)byvaryingthresholdforthecon dencevaluefrom0to1.WealsomeasuredthosefortheSCFsetsobtainedusingfrequencycut-offbyvaryingthresholdfortherelativefrequencyfrom0to1.Thegraphapparentlyindicatesthatthecon dencecut-offsoutperformedthefrequencycut-off.Whenwe
5WeusedtheSCFcon dence-valuevectorsforwordswhich
Figure4:PrecisionandrecalloftheresultingSCFsusingcon dencecut-offandfrequencycut-off
1
centroid cut-off 0.03centroid cut-off 0.03*
0.8
0.6Recall
0.4 0.2 0
0 0.2 0.4
Precision
0.6 0.8 1
Figure5:PrecisionandrecalloftheresultingSCFsusingcon dencecut-offandfrequencycut-off
comparecon dencecut-offswithdifferentrecognitionthresholds,wecanimproveprecisionusinghigherrecog-nitionthresholdwhilewecanimproverecallusinglowerrecognitionthreshold.Thisresultisquiteconsistentwithourexpectations.
Wethencomparecentroidcut-offwithcon dencecut-offtoobserveeffectsofclusteringusinginformationinthelexiconoftheXTAGEnglishgrammar.Figure5showsprecisionandrecalloftheresultingSCFsetsusingcentroidcut-offandcon dencecut-offwiththerecogni-tionthresholdt=0.03byvaryingthethresholdforthecon dencevalue.Inordertoshowtheeffectsofinfor-mationofthetrainingSCFlexicon,centroidcut-off0.03*isSCFsobtainedbyclusteringofSCFcon dence-valuevectorsintheacquiredSCFsonlywithrandominitial-ization.ThegraphapparentlyshowsthatclusteringismeaningfulonlywhenwemakeuseofthereliableSCFcon dence-valuevectorsobtainedfromthemanuallytai-
areincludedintheXTAGEnglishgrammar.WhenboththetrainingSCFlexiconandtheacquiredSCFlexiconhavethesamewords,wesimplyusedanSCFcon dence-valuevectorobtainedfromtheacquiredSCFlexicon.
6WeusedtheSCFcon dence-valuevectorsthatappearwithmorethantwowords.