We present a novel probabilistic method for partially unsupervised topic segmentation on unstructured text. Previous approaches to this problem utilize the hidden Markov model framework (HMM). The HMM treats a document as mutually independent sets of words
4.2TheaspectHMM5
Figure2:AgraphicalmodelrepresentingasegmentingAHMM
assignment.However,inpractice,fora xedispeakedtowardsonevalueof.Inthiscase,wefeeljusti edinassigningeachsegmenttothefactorwithmaximalprobability.
TheAHMMsegmentsanewdocumentbydividingitswordsintoobservationwin-dowsofsizeandrunningtheViterbialgorithmto ndthemostlikelysequenceofhiddentopicswhichgeneratedthegivendocument.Segmentationbreaksoccurwhenthevalueofthetopicvariablechangesfromonewindowtothenext.TheViterbialgo-rithmrequirestheobservationprobabilityforeachtimestep.WhiletheHMMusesthenaiveBayesassumptiontocomputethisdistribution,wetreateachasanewsegmentlabelandcomputeviatheaspectmodel.Oneproblemwiththeaspectmodelisthatitisnotatrulygenerativemodelwithre-
parameterisadiscretedistributionoverthesetofspecttodocumentlabels.The
trainingdocuments.Therefore,themodelcanonlycomputeconditionalprobabilitiesaboutthosesegmentswhichitwasexposedtointraining.IntheViterbialgorithm,weneedto ndforsomeobservationwindow.Thisobservationisnotadocu-mentlabelthatthemodelhasseenbefore.Toproperly nd,oneshouldretrainthemodelusingEMonthetrainingcorpusaswellasandthewordsitcontains.However,thisisveryinef cient.Inpractice,onecanuseanonlineapproximationtoEMto nd.Weuseavariantasdescribedin[3].Letwheredenotesnowordanddenotesthefullobservation.Weapproximaterecursivelyasfollows.