手机版

Generalizing Subcategorization Frames Acquired from Corpora(4)

发布时间:2021-06-08   来源:未知    
字号:

This paper presents a method of improving the quality of subcategorization frames (SCFs) acquired from corpora in order to augment a lexicon of a lexicalized grammar. We first estimate a confidence value that a word can have each SCF, and create an SCF con

Input:asetofSCFconfidence-value

vectorsV={v1,v2,...,vn} Rm

adistancefunctiond:Rm×Zm→Rafunctiontocomputeacentroidµ:{vj1,vj2,...,vjl}→Rm

Output:asetofclustersCj

whileclustermembersarenotstabledoforeachclusterCj

Cj={vi| cl,d(vi,cj)≤d(vi,cl)}endforeach

foreachclustersCjcj=µ(Cj)endforeachendwhilereturnCj

Aftereveryassignment,wedetermineanextcentroidcmofeachclusterCmasfollows:

1when∏vij>∏(1 vij)

vi∈Cmvi∈Cmcmj=(9)

0otherwiseWethenaddressthewaytodeterminethenumberof

clustersandinitialassignmentsofobjects.Inthispaper,weassumethatthemostofthepossiblesetofSCFsforwordsareincludedinthetargetlexicalizedgrammar,andmakeuseoftheexistingsetsofSCFsforthewordsinthelexiconofthetargetgrammartodeterminethepossiblesetofSCFsforwordsoutofthelexicon.We rstex-tractSCFcon dence-valuevectorsfromthelexiconofthetargetgrammarbyregardingε=0inEquation7.Byeliminatingduplicationsfromthem,weobtainSCFcentroid-valuevectorscm.Wetheninitializethenumberofclustersktothenumberofcmandusethemasinitialcentroids.4

We nallyupdatetheacquiredSCFsusingeachele-ment’svalueinthecentroidofeachclusterandthecon -dencevalueofSCFsinthisorder.We rsteliminateSCFsjforwiinaclustermwhenthevaluecmjofthecentroidcmis0,andaddSCFsjforwiinaclustermwhenthevaluecmjofthecentroidcmis1.Thisisbecausecmjrep-resentswhetherthewordsinthatclasscanhaveSCFsj.WetheneliminateimplausibleSCFssjforwifromtheresultingSCFsaccordingtoitscorrespondingcon dencevalueconfij.Wecallthiseliminationcentroidcut-off.Inthefollowingexperiments,wecomparethiscut-offwithnaivefrequencycut-off,whichusesonlyrelativefrequen-ciestoeliminateSCFsandcon dencecut-off,whichusesonlycon dencevaluestoeliminateSCFs.Notethatfre-quencycut-offandcon dencecut-offuseonlycorpus-basedstatisticstoeliminateSCFs.

Figure3:ClusteringalgorithmforSCFcon dence-valuedistributions

lexiconofthetargetgrammar,wealsorepresentSCFcon dence-valuevectorsforthewordsinthetargetgram-mars.Inthispaper,weexpressSCFcon dence-valuevectorsv iforwordsintheSCFlexiconofthetargetgram-marby:

1 εwihassjinthelexicon

vij=(7)

εotherwisewhereεexpressesanunreliabilityofthelexicon.Inthis

study,wesimplysetittothemachineepsilon.Inotherwords,wetrustthelexiconasmuchaspossible.3.2

ClusteringAlgorithmforSCFCon dence-ValueDistributions

Wenextpresentak-Means-likeclusteringalgorithmforSCFcon dence-valuevectors,asshowninFigure3.Givenaninitialassignmentofdataobjectstokclusters,ouralgorithmcomputesarepresentativevalueofeachclustercalledcentroids.Ouralgorithmtheniterativelyupdatesclustersbyassigningeachobjecttoitsclosestcentroidandrecomputingcentroidsuntilclustermembersbecomestable.

Althoughouralgorithmisroughlybasedonthek-Meansalgorithm,itisdifferentinanimportantrespect.Wede netheelementsofthecentroidvaluesoftheob-tainedclustersasadiscretevalueof0or1becausewewanttoobtainclusterswhichincludewordsthathavetheexactlysamesetofSCFs.Wethenderiveadistancefunc-tiondtocalculatethedistancefromadataobjectvitoeachcentroidcm.Sincethedistancefunctionisusedtodeterminetheclosestclusterforvi,wede nethefunc-tiondtooutputtheprobabilitythatvihastheSCFsetexpressedbycentroidcmasfollows:

d(vi,cm)=

cmj=1

4Experiments

vij·

cmj=0

∏(1 vij).

(8)

WeappliedourmethodtoanSCFlexiconacquiredfrom135,,whichisthesamedatausedin(CarrollandFang,2004).Thenumberoftheresult-ingSCFsis14,783for3,864wordstems.Wethentrans-latedthemtoanSCFlexiconfortheXTAGEnglishgram-mar(XTAGResearchGroup,2001)byusingatranslationmapmanuallyde nedbyTedBriscoe.Itde nesamap-pingfrom23outof163possibleSCFtypesinto13outof57XTAGSCFscalledtreefamilieslistedinTable1.ThenumberofresultingSCFsfortheXTAGEnglishgram-marwas6,742for2,860wordstems.

alexiconofthegrammarisnotcomprehensiveor

lessaccurate,weshoulddeterminethenumberofclustersusingotheralgorithms(Bischofetal.,1999;Hamerly,2003).

4When

Byusingthisfunction,wecandeterminetheclosestclus-terasargmaxd(vi,cm).

Cm

Generalizing Subcategorization Frames Acquired from Corpora(4).doc 将本文的Word文档下载到电脑,方便复制、编辑、收藏和打印
×
二维码
× 游客快捷下载通道(下载后可以自由复制和排版)
VIP包月下载
特价:29 元/月 原价:99元
低至 0.3 元/份 每月下载150
全站内容免费自由复制
VIP包月下载
特价:29 元/月 原价:99元
低至 0.3 元/份 每月下载150
全站内容免费自由复制
注:下载文档有可能出现无法下载或内容有问题,请联系客服协助您处理。
× 常见问题(客服时间:周一到周五 9:30-18:00)