We present a novel probabilistic method for partially unsupervised topic segmentation on unstructured text. Previous approaches to this problem utilize the hidden Markov model framework (HMM). The HMM treats a document as mutually independent sets of words
5.1AspectmodelEMtraining7
Figure3:TemperedEMconvergenceintheATCandNYTcorpora
Withintheseshowsthereare4,917segmentswithavocabularyof35,777uniqueterms.Theshowsconstituteabout4millionwords.Weestimatedtheworderrorrateinthiscorporatobeinthe30%to40%range.Notethattheseareonlyestimatescomputedfromsamplingthecorporaasperfecttranscriptsareunavailabletous.
Additionally,weanalyzedacorpusof3,830articlesfromtheNewYorkTimes(NYT)tocomparetheASRperformancewitherror-freetext.Thiscorpusconstitutesabout4millionwordswithavocabularyof70,792uniqueterms.Inallreportedexper-iments,welearnanaspectmodelwith20hiddenfactors.
5.1AspectmodelEMtraining
Figure3illustratestheperformanceonheldoutdataduringthetemperedEMtrainingoftheaspectmodel(seesection4.1).ThoughtheNYTcorpustakeslongertoconverge(duetothehighervocabularysize),itlearnsmorequicklythantheATCcorpussincethetextcontainsnoerrors.TheATCconvergesfaster(duetothesmallervocabularysize)butstaysatalowCoAP(seesection5.3)forseveraliterationsbeforeperformanceimproves.
5.2Sampleresultsandtopiclabels
Inourexperiments,weusethreevariantsofourtwocorpora.First,wecreaterandomsequencesofsegmentsfromtheATCcorpus.Second,wecreaterandomsequencesfromtheNYTcorpustocomparecleanversusnoisysegmentation.Finally,weusetheactualairedsequencesofATCsegmentssincethisisdomainoftheprimaryproblemwhichwearetryingtotackle.
Intherandomsequencesofsegments,weattainalmostperfectsegmentationonbothcorpora.However,theresultsaremixedwiththeoriginalbroadcastsoftheATC.