We present a novel probabilistic method for partially unsupervised topic segmentation on unstructured text. Previous approaches to this problem utilize the hidden Markov model framework (HMM). The HMM treats a document as mutually independent sets of words
12REFERENCESexploreatemporalanalysisofourdataandmodellongtermtopicshiftsinthehiddenfactorsandlanguagemodels.
References
[1]DougBeeferman,AdamBerger,andJohnLafferty.Statisticalmodelsfortext
segmentation.MachineLearning,1999.
[2]A.P.Dempster,ird,andD.B.Rubin.Maximumlikelihoodfromincom-
pletedataviatheemalgorithm.JournaloftheRoyalStatisticalSociety,SeriesB(Methodological),39(1):1–38,1977.
[3]DanielGildeaandThomasHofmann.Topic-basedlanguagemodelsusingem.
EuroSpeech-99,pages2167–2170,1999.
[4]MartiA.Hearst.Contextandstructureinautomatedfull-textinformationaccess.
puterScienceDivisionTech-nicalReport,1994.
[5]ThomasHofmann.Probabilisticlatentsemanticindexing.Proceedingsofthe
Twenty-SecondAnnualInternationalSIGIRConferenceonResearchandDevel-opmentinInformationRetrieval,1999.
[6]P.vanMulbregt,I.Carp,L.Gillick,S.Lowe,andJ.Yamron.Textsegmentation
andtopictrackingonbroadcastnewsviaahiddenmarkovmodelapproach.1998.
[7]AndrewJ.Viterbi.Errorboundsforconvolutionalcodesandanasymptotically
optimaldecodingalgorithm.IEEETransactionsonInformationTheory,13:260–269,1967.