We present a novel probabilistic method for partially unsupervised topic segmentation on unstructured text. Previous approaches to this problem utilize the hidden Markov model framework (HMM). The HMM treats a document as mutually independent sets of words
65EXPERIMENTALRESULTSNotethatisnotameaningfulprobability.However,theViterbialgorithmonlyneedstocomputeforasingleobservationatatime.Thus,behaveslikeascalingconstantandwecancomputeuptothisfactor.Finally,sincetheViterbialgorithmonlycomparesprobabilities,wecanusethisproportionalprobabilitywithoutanyloss.
Theseformulaere ectanonlineapproximationofoneE-stepintheEMalgorithm.Wepresenthereanintuitivederivationtoillustratewhytheymakesenseassuchanapproximation.Wewouldliketorecursivelyestimatefrompartialestimatesof.First,noticethatistheemptyword.Thisimmediatelygivesusthebasecase.
Wecanexpressintermsofourpreviousinformationasfollows.
Weassumethat,inapartialobservationsequence,themarginalprobabilityofse-lectinganywordissimply.Observethatwhen,thewordisassumedtohavebeenaccountedforinandisabsorbedintheconditioning.When,wecancomputebyasimpleapplicationofBayesrule.The nalequationexpressesintermsof.Astheapproxima-torseesmorewordsinasingleobservation,itre nesitsposteriordistributionofthetopic.Itusesthisre nedposteriortoweightthedistributionofthenextword.5Experimentalresults
Weappliedthissegmentationmodeltotwolargecorpora.First,weexaminedSPEECH-BOTtranscriptsfromAllThingsConsidered(ATC),adailynewsprogramonNationalPublicRadio.Ourcorpusspans317showsfromAugust1998throughDecember1999.