We present a novel probabilistic method for partially unsupervised topic segmentation on unstructured text. Previous approaches to this problem utilize the hidden Markov model framework (HMM). The HMM treats a document as mutually independent sets of words
10
P(missed)
RandomNYT
0.263
ActualATC5EXPERIMENTALRESULTSP(disagree)0.0800.1430.063
Figure6:CoAPresultsontheATCandNYTcorpora.Inthecaseofrandomlygen-eratedtranscripts,thereportedresultsarethemeanovertensetsofrandomtranscriptstakenfromthesamesetoftestingsegments.
5.3QuantitativeResults
Weusetheco-occurrenceagreementprobability(CoAP)introducedin[1]toquantita-tivelyevaluateoursegmenter.TheCoAPisde nedas
agreement
Thefunctionisaprobabilitydistributionoverthedistancesbetweenwordsinadocument;thefunctionsareifthetwowordsfallinthesamesegmentandotherwise;andfunctionindicatesagreementbetweentheoperands.Inourcase,ifthewordsarewordsapartandotherwise.Withthischoiceof,theCoAPisameasureofhowoftenasegmentationiscorrectwithrespecttotwowordsthatarewordsapartinthedocument.Following[1],wechoosetobehalftheaveragelengthofasegmentinthetrainingcorpus,170intheATCcorpus,and200intheNYTcorpus.
AusefulinterpretationoftheCoAPisthroughitscompliment[1]
disagreementmissedsegsegfalse
wheresegistheaprioriprobabilityofasegment,missedistheprobabilityofmissingasegment,andfalseistheprobabilityofhypothesizingasegmentwherethereisnosegment.
Figure6showstheerroranditsdecompositionforthreeexperiments:theNYTcorpuswithrandomlygeneratedsequencesofarticles;theATCcorpuswithrandomlygeneratedsequencesofsegments;andtheATCcorpuswiththetrueorderingofseg-mentsastheywereaired.Itisinterestingtonotethatoursystemtendstounderseg-mentasindicatedbythehighmissed.Furthermore,intheactualATCorderingsmissedisevenhigherduetothephenomenonofmultiplesegmentswithsimilartopics(seesection5.2).
Figure7isacomparisonbetweentheAHMMandHMMoverwindowwidthsfrom2to200.AHMMsegmentationoutperformsHMMsegmentationforsmallwin-dowwidths.However,asweincreasethewindowsize,theperformanceoftheas-pectmodeldecreases.Thisisduetotwofacts.First,theprecisionofthesegmenterdecreases,causingaslightdecreaseinscore.Moreimportantlyhowever,thisbehav-ioroccursbecauseweareusinganapproximationof.Intheapproximation