We present a novel probabilistic method for partially unsupervised topic segmentation on unstructured text. Previous approaches to this problem utilize the hidden Markov model framework (HMM). The HMM treats a document as mutually independent sets of words
TopicSegmentationwithanAspectHiddenMarkovModel
DavidM.Blei
UniversityofCalifornia,Berkeley
Dept.ofComputerScience
Berkeley,CA,94720
PedroJ.Moreno
CambridgeResearchLaboratory
CompaqComputerCorporation
CambridgeMA02142-1612
July2001
Abstract
Wepresentanovelprobabilisticmethodforpartiallyunsupervisedtopicsegmen-tationonunstructuredtext.PreviousapproachestothisproblemutilizethehiddenMarkovmodelframework(HMM).TheHMMtreatsadocumentasmutuallyindepen-dentsetsofwordsgeneratedbyalatenttopicvariableinatimeseries.WeextendthisideabyembeddingtheaspectmodelfortextintothesegmentingHMM.Indoingso,weprovideanintuitivetopicaldependencybetweenwordsandacohesivesegmentationmodel.WeapplythismethodtosegmentunbrokenstreamsofNewYorkTimesarti-clesaswellasnoisytranscriptsofradioprogramsonSPEECHBOT1,anonlineaudioarchiveindexedbyanautomaticspeechrecognitionengine.WeprovideexperimentalcomparisonsbetweenourtechniqueandtheHMMapproach.OurresultssuggestthatthistechniquecanperformaswellastheHMMmethodandinsomecasesevenbetter.