This paper presents a method of improving the quality of subcategorization frames (SCFs) acquired from corpora in order to augment a lexicon of a lexicalized grammar. We first estimate a confidence value that a word can have each SCF, and create an SCF con
Input:asetofSCFconfidence-value
vectorsV={v1,v2,...,vn} Rm
adistancefunctiond:Rm×Zm→Rafunctiontocomputeacentroidµ:{vj1,vj2,...,vjl}→Rm
Output:asetofclustersCj
whileclustermembersarenotstabledoforeachclusterCj
Cj={vi| cl,d(vi,cj)≤d(vi,cl)}endforeach
foreachclustersCjcj=µ(Cj)endforeachendwhilereturnCj
Aftereveryassignment,wedetermineanextcentroidcmofeachclusterCmasfollows:
1when∏vij>∏(1 vij)
vi∈Cmvi∈Cmcmj=(9)
0otherwiseWethenaddressthewaytodeterminethenumberof
clustersandinitialassignmentsofobjects.Inthispaper,weassumethatthemostofthepossiblesetofSCFsforwordsareincludedinthetargetlexicalizedgrammar,andmakeuseoftheexistingsetsofSCFsforthewordsinthelexiconofthetargetgrammartodeterminethepossiblesetofSCFsforwordsoutofthelexicon.We rstex-tractSCFcon dence-valuevectorsfromthelexiconofthetargetgrammarbyregardingε=0inEquation7.Byeliminatingduplicationsfromthem,weobtainSCFcentroid-valuevectorscm.Wetheninitializethenumberofclustersktothenumberofcmandusethemasinitialcentroids.4
We nallyupdatetheacquiredSCFsusingeachele-ment’svalueinthecentroidofeachclusterandthecon -dencevalueofSCFsinthisorder.We rsteliminateSCFsjforwiinaclustermwhenthevaluecmjofthecentroidcmis0,andaddSCFsjforwiinaclustermwhenthevaluecmjofthecentroidcmis1.Thisisbecausecmjrep-resentswhetherthewordsinthatclasscanhaveSCFsj.WetheneliminateimplausibleSCFssjforwifromtheresultingSCFsaccordingtoitscorrespondingcon dencevalueconfij.Wecallthiseliminationcentroidcut-off.Inthefollowingexperiments,wecomparethiscut-offwithnaivefrequencycut-off,whichusesonlyrelativefrequen-ciestoeliminateSCFsandcon dencecut-off,whichusesonlycon dencevaluestoeliminateSCFs.Notethatfre-quencycut-offandcon dencecut-offuseonlycorpus-basedstatisticstoeliminateSCFs.
Figure3:ClusteringalgorithmforSCFcon dence-valuedistributions
lexiconofthetargetgrammar,wealsorepresentSCFcon dence-valuevectorsforthewordsinthetargetgram-mars.Inthispaper,weexpressSCFcon dence-valuevectorsv iforwordsintheSCFlexiconofthetargetgram-marby:
1 εwihassjinthelexicon
vij=(7)
εotherwisewhereεexpressesanunreliabilityofthelexicon.Inthis
study,wesimplysetittothemachineepsilon.Inotherwords,wetrustthelexiconasmuchaspossible.3.2
ClusteringAlgorithmforSCFCon dence-ValueDistributions
Wenextpresentak-Means-likeclusteringalgorithmforSCFcon dence-valuevectors,asshowninFigure3.Givenaninitialassignmentofdataobjectstokclusters,ouralgorithmcomputesarepresentativevalueofeachclustercalledcentroids.Ouralgorithmtheniterativelyupdatesclustersbyassigningeachobjecttoitsclosestcentroidandrecomputingcentroidsuntilclustermembersbecomestable.
Althoughouralgorithmisroughlybasedonthek-Meansalgorithm,itisdifferentinanimportantrespect.Wede netheelementsofthecentroidvaluesoftheob-tainedclustersasadiscretevalueof0or1becausewewanttoobtainclusterswhichincludewordsthathavetheexactlysamesetofSCFs.Wethenderiveadistancefunc-tiondtocalculatethedistancefromadataobjectvitoeachcentroidcm.Sincethedistancefunctionisusedtodeterminetheclosestclusterforvi,wede nethefunc-tiondtooutputtheprobabilitythatvihastheSCFsetexpressedbycentroidcmasfollows:
d(vi,cm)=
cmj=1
4Experiments
∏
vij·
cmj=0
∏(1 vij).
(8)
WeappliedourmethodtoanSCFlexiconacquiredfrom135,,whichisthesamedatausedin(CarrollandFang,2004).Thenumberoftheresult-ingSCFsis14,783for3,864wordstems.Wethentrans-latedthemtoanSCFlexiconfortheXTAGEnglishgram-mar(XTAGResearchGroup,2001)byusingatranslationmapmanuallyde nedbyTedBriscoe.Itde nesamap-pingfrom23outof163possibleSCFtypesinto13outof57XTAGSCFscalledtreefamilieslistedinTable1.ThenumberofresultingSCFsfortheXTAGEnglishgram-marwas6,742for2,860wordstems.
alexiconofthegrammarisnotcomprehensiveor
lessaccurate,weshoulddeterminethenumberofclustersusingotheralgorithms(Bischofetal.,1999;Hamerly,2003).
4When
Byusingthisfunction,wecandeterminetheclosestclus-terasargmaxd(vi,cm).
Cm