We present a novel probabilistic method for partially unsupervised topic segmentation on unstructured text. Previous approaches to this problem utilize the hidden Markov model framework (HMM). The HMM treats a document as mutually independent sets of words
85EXPERIMENTALRESULTS
IABACDECFAGFHJ
1234567891011121314151617
Figure4:AsegmentationofAllThingsConsideredfromApril29,1999.Thetopdiagramisthehypothesissegmentation.Thebottomdiagramisthetruesegmentation.Figure4showsasegmentationfromarealtranscriptofATConApril29,1999.Thesegmentationisnotperfectbuthypothesizesthedetectedtopicbreaksatapproximatelythecorrectpointsintheprogram.At rst,thereseemtobemanymissedbreaks.Wearguehoweverthatthesemissedstorybreaksdonotalwaysconstitutetopicbreaksandthereforearenotindicativeoftheperformanceofourmodel.Toillustratethis,weexploreamethodoftopiclabelingbasedonthelanguagemodelparametersoftheaspectmodel.
Onewayofidentifyingthetopicswhichthesegmenter ndsisbythetop fteen
parameterforthevalueofwhichtheViterbialgorithmassignedwordsofthe
toaparticularsegment.Figure5liststhesewordsets(denotedbyaletter)astheycorrespondtothetopicsinthesegmentation(denotedbyanumber).Forexample,story14isabouttheIsraeli/Palestiniancon ict.Itscorrespondingsegmentinthehy-pothesissegmentationcanbedescribedbythewordsintopicFwhichincludepeace,israeli,andpalestinian.
Analysisofthiscorrespondenceoftenexplainsmissedtopicbreaks.Articles11and12arebothabouttheKosovarrefugees.Understandably,theyarebothassignedtotopicAandthebreakbetweenstoriesgoesundetected.
Notethatthesegmentercanworkevenifthetopwordsoffailtogiveagoodtopicdescription.ThestoryaboutdeformedfrogsisassignedtopicI,arathergenericlanguagemodelwithnorealdescriptivewords.However,thesubsequentstoryabouttheeconomy tstopicJsowellthattheAHMMisabletoproperlydetectthebreak.