We present a novel probabilistic method for partially unsupervised topic segmentation on unstructured text. Previous approaches to this problem utilize the hidden Markov model framework (HMM). The HMM treats a document as mutually independent sets of words
23HMMSEGMENTATION
Figure1:AgraphicalmodelrepresentingthesegmentingHMM
3HMMSegmentation
InthesegmentingHMMframework,anunsegmenteddocumentistreatedasacollec-tionofmutuallyindependentsetsofwords.Themodelpositsthateachsetisprob-abilisticallygeneratedbyahiddentopicvariableinaseries.Transitionprobabilitiesbetweentopicsdeterminethenexthiddenvariableinthesequence.
Asagenerativemodel,theHMMpositsthatadocumentisproducedbythefol-lowingprocess:chooseatopicfromaninitialdistributionoftopics;generateasetofindependentwordsfromadistributionoverwordsassociatedwiththattopic;chooseanothertopic,possiblythesametopicfromadistributionofallowedtransitions;repeatthisprocess.Givenanew,unsegmenteddocument,oneinvertsthisprocessbycalculat-ingthemostlikelysetoftopicswhichgeneratedthe-wordsetsofthegivendocument.Topicbreaksoccuratthepointswherethevalueofthetopicvariableschange.
Moreformally,aresetsofwordsandaregen-eratedbyatopic.Eachdependsonlyonandtheareindependentofeachothergiven.Thisisillustratedinthegraphicalmodelin gure1.Circlesrepre-sentrandomvariablesandarrowsindicatepossiblydependency.Theboxaroundindicatesthatthisrandomvariableisrepeatedtimesforeachtopicvariableintheseries.
TheHMMisparameterizedbyatransitionprobabilitydistributionbetweentopicsandasetoftopic-basedunigramlanguagemodelsforeachpossiblevalueof.Totrainthemodel,asetofsegmentsfromacorpusisclusteredusingthe-meansalgorithm.Aunigramlanguagemodeliscomputedforeachoftheseclustersandanappropriatesmoothingtechniqueisappliedtoaccountforsparsity.Thetransitionprob-
isaparameterwhichisseparatelyabilitydistributionbetweentopicstates
tunedin[6].Wesimplyusenormalizedcountsoftransitionsbetweenclustersinthetrainingsettoestimateit.Notethatthismodelrequiresasegmentedcorpustotrain,butworksinanunsupervisedmannertoclusterthosesegments.
Tosegmentanewdocument,thestreamoftextisdividedintoasequenceof
ofwordseach.TheViterbialgorithm[7],adynamicprogram-observations
mingtechnique,isusedto ndthemostlikelyhiddensequenceoftopicstates
givenanobservedsequenceofwordsets.Topicbreaksoccurwhen.