手机版

Generalizing Subcategorization Frames Acquired from Corpora(3)

发布时间:2021-06-08   来源:未知    
字号:

This paper presents a method of improving the quality of subcategorization frames (SCFs) acquired from corpora in order to augment a lexicon of a lexicalized grammar. We first estimate a confidence value that a word can have each SCF, and create an SCF con

sjforwi,whichexpresseshowreliableawordwihasSCFsj.Weshouldnotethatthecon dencevalueisnottheprobabilitythatawordwiappearswithSCFsjbutaprobabilityofexistenceofSCFsjforthewordwi.Inthisstudy,weassumethatawordwicanhaveeachSCFsjwithacertain(non-zero)probabilityθij(=p(sij|wi)>0where∑jθij=1),butonlySCFswhoseprobabilitiesex-ceedacertainthresholdarerecognizedasSCFsforthewordinthelexicon.Wehereaftercallthisthresholdrecognitionthreshold.Figure2exempli esaprobabil-itydistributionofSCFsforapply.Inthiscontext,wecanregardacon dencevalueofeachSCFasthepossi-bilitythataprobabilityofaSCFexceedstherecognitionthreshold.

Oneintuitivewaytoestimateacon dencevalueistoassumeanobservedprobability,i.e.,relativefrequency,isequaltoaprobabilityθijofSCFsjforawordwi(θij=freqij/∑jfreqijwherefreqijisafrequencycountthatawordwihavetheSCFsjincorpora1).Wesimplyassign1toacon dencevalueconfijwhentherelativefrequencyofsjforawordwiexceedstherecognitionthreshold,andotherwiseassign0toacon dencevalueofconfij.However,anobservedprobabilityistotallyunreliableforinfrequentwords.Forexample,whenweuseacon dencevaluederivedfromarelativefrequencyasabove,wecannotdistinguishcaseswhereawordw1appearsoncewithaSCFsjandawordw2appears100times,alwayswiththeSCFsj,whichareboththerela-tivefrequency1.Moreover,evenwhenwewouldliketoencodecon dencevaluesofreliableSCFsinthetargetlexicalizedgrammar,itisalsoproblematictodistinguishthecon dencevalueofthoseSCFswithcon denceval-uesofacquiredSCFs.

TheotherpromisingwaytoestimateatrueprobabilityθijistoregarditasastochasticvariableinthecontextofBayesianstatistics(Gelmanetal.,1995).Inthiscontext,aposterioridistributionoftheprobabilityθijofaSCFsjforawordwiisgivenby:

p(θij|D)=

=

P(θij)P(D|θij)

P(D)

P(θij)P(D|θij)

representedbybinominaldistribution:

n

θixj(1 θij)(n x).P(D|θij)=

x

(2)

Tocalculatethisaposterioridistribution,weneedtode netheaprioridistributionP(θij).Thequestioniswhichprobabilitydistributionofθijcanappropriatelyre- ectpriorknowledge.Inotherwords,itshouldencodeknowledgeweusetoestimateSCFsforanunknownwordwi.Wesimplydetermineitfromdistributionsofproba-bilityvaluesofsjforknownwords.Weusedistributionsofobservedprobabilityvaluesofsjforallwordsacquiredfromthecorpusbyusingamethoddescribedin(Tsu-ruokaandChikayama,2001).Intheirstudy,theyassumeaprioridistributionasthebetadistributionde nedas:

p(θij|α,β)=

1θiα(1 θij)β 1j

B(α,β)

,(3)

1

whereB(α,β)=01θiα(1 θij)β 1dθij.Thevalueofj

αandβisdeterminedbymomentestimation.2Bysub-stitutingEquations2and3intoEquation1,we nallyobtaintheaposterioridistributionp(θij|D)as:

1θiα(1 θij)β 1 n xj(n x)

xθij(1 θij)0P(θij)P(D|θij)dθij

c·θixj+α 1(1 θij)n x+β 1(4)

p(θij|α,β,D)=

=

1

/(B(α,β)wherec=n0P(θij)P(D|θij)dθij).x

Whenwedeterminethevalueoftherecognitionthresholdast,wecancalculateacon dencevalueconfijthatawordwicanhavesjbyintegratingtheaposterioridistributionp(θij|D)fromthethresholdtto1:

confij

=

1

t

c·θixj+α 1(1 θij)n x+β 1dθij(5)

P(θij)P(D|θij)dθij

,(1)

Byusingthiscon dencevalue,wecanexpressanSCFcon dence-valuevectorviforawordwiintheacquiredSCFlexicon(vij=confij).3

InordertocombineSCFcon dence-valuevectorsforwordsacquiredfromcorporaandthoseforwordsinthe

expectationvalueandvarianceofthebetadistribution

aremadeequaltothoseoftheobservedprobabilityvalues.3Byusingthefactthat 1P(θ|α,β)=1,wecancalculate

ij0

confijasfollows.

1

2The

whereP(θij)isaprioridistribution,andDisthedatawehaveobserved.SinceeveryoccurrenceofSCFsinthedataDisindependentwitheachother,thedataDcanberegardedasBernoullitrialsinthiscase.WhenweobservethedataDthatawordwiappearsntimesandhasSCFsjx(≤n)times,itsconditionaldistributionistherefore

1WeusedvaluesofFREQCNTtoobtainfrequencycountsof

confij==

x+α 1

(1 θij)n x+β 1dθijtc·θij

(1 θij)n x+β 1dθij0c·θij

1x+α 1

(1 θij)n x+β 1dθijtθij

x+ 1

(1 θij)n x+β 1dθij0θij

(6)

SCFs.

Generalizing Subcategorization Frames Acquired from Corpora(3).doc 将本文的Word文档下载到电脑,方便复制、编辑、收藏和打印
×
二维码
× 游客快捷下载通道(下载后可以自由复制和排版)
VIP包月下载
特价:29 元/月 原价:99元
低至 0.3 元/份 每月下载150
全站内容免费自由复制
VIP包月下载
特价:29 元/月 原价:99元
低至 0.3 元/份 每月下载150
全站内容免费自由复制
注:下载文档有可能出现无法下载或内容有问题,请联系客服协助您处理。
× 常见问题(客服时间:周一到周五 9:30-18:00)