We present a novel probabilistic method for partially unsupervised topic segmentation on unstructured text. Previous approaches to this problem utilize the hidden Markov model framework (HMM). The HMM treats a document as mutually independent sets of words
Authorsemail:blei@cs.berkeley.edu,Pedro.Moreno@
cCompaqComputerCorporation,2001
Thisworkmaynotbecopiedorreproducedinwholeorinpartforanycommercialpurpose.Per-missiontocopyinwholeorinpartwithoutpaymentoffeeisgrantedfornonpro teducationalandresearchpurposesprovidedthatallsuchwholeorpartialcopiesincludethefollowing:anoticethatsuchcopyingisbypermissionoftheCambridgeResearchLaboratoryofCompaqComputerCorpo-rationinCambridge,Massachusetts;anacknowledgmentoftheauthorsandindividualcontributorstothework;andallapplicableportionsofthecopyrightnotice.Copying,reproducing,orrepub-lishingforanyotherpurposeshallrequirealicensewithpaymentoffeetotheCambridgeResearchLaboratory.Allrightsreserved.
CRLTechnicalreportsareavailableontheCRL’swebpageat
.
CompaqComputerCorporation
CambridgeResearchLaboratory
OneCambridgeCenter
Cambridge,Massachusetts02142USA