ContinuousTimeDynamicTopicModels
ChongWangComputerScienceDept.PrincetonUniversityPrinceton,NJ08540DavidBlei
ComputerScienceDept.PrincetonUniversityPrinceton,NJ08540DavidHeckermanMicrosoftResearchOneMicrosoftWayRedmond,WA98052
Abstract
Inthispaper,wedevelopthecontinuoustimedynamictopicmodel(cDTM).ThecDTMisadynamictopicmodelthatusesBrownianmotiontomodelthelatenttopicsthroughasequentialcollectionofdocuments,wherea“topic”isapatternofwordusethatweexpecttoevolveoverthecourseofthecol-lection.Wederiveane cientvariationalapproximateinferencealgorithmthattakesadvantageofthesparsityofobservationsintext,apropertythatletsuseasilyhan-dlemanytimepoints.IncontrasttothecDTM,theoriginaldiscrete-timedynamictopicmodel(dDTM)requiresthattimebediscretized.Moreover,thecomplexityofvari-ationalinferenceforthedDTMgrowsquicklyastimegranularityincreases,adrawbackwhichlimits ne-graineddiscretization.WedemonstratethecDTMontwonewscorpora,reportingbothpredictiveperplexityandthenoveltaskoftimestampprediction.
email[15],computervision[7],bioinformatics[18],andinformationretrieval[24].Foragoodreview,see[8].Mosttopicmodelsassumethedocumentsareex-changeableinthecollection,i.e.,thattheirprobabilityisinvarianttopermutation.Manydocumentcollec-tions,suchasnewsorscienti cjournals,evolveovertime.Inthispaper,wedevelopthecontinuoustimedynamictopicmodel(cDTM),whichisanextensionofthediscretedynamictopicmodel(dDTM)[2].Givenasequenceofdocuments,weinferthelatenttopicsandhowtheychangethroughthecourseofthecollection.ThedDTMusesastatespacemodelonthenaturalpa-rametersofthemultinomialdistributionsthatrepre-sentthetopics.Thisrequiresthattimebediscretizedintoseveralperiods,andwithineachperiodLDAisusedtomodelitsdocuments.In[2],theauthorsan-alyzethejournalSciencefrom1880-2002,assumingthatarticlesareexchangeablewithineachyear.WhilethedDTMisapowerfulmodel,thechoiceofdiscretiza-tiona ectsthememoryrequirementsandcomputa-tionalcomplexityofposteriorinference.Thislargelydeterminestheresolutionatwhichto tthemodel.Toresolvetheproblemofdiscretization,weconsidertimetobecontinuous.Thecontinuoustimedynamictopicmodel(cDTM)proposedherereplacesthedis-cretestatespacemodelofthedDTMwithitscontinu-ousgeneralization,Brownianmotion[14].ThecDTMgeneralizesthedDTMinthattheonlydiscretizationitmodelsistheresolutionatwhichthetimestampsofthedocumentsaremeasured.
ThecDTMmodelwill,generally,introducemanymorelatentvariablesthanthedDTM.However,thisseem-inglymorecomplicatedmodelissimplerandmoree -cientto t.Aswewillseebelow,fromthisformulationthevariationalposteriorinferenceprocedurecantakeadvantageofthenaturalsparsityoftext,thefactthatnotallvocabularywordsareusedateachmeasuredtimestep.Infact,astheresolutiongets ner,fewerandfewerwordsareused.
1Introduction
Toolsforanalyzingandmanaginglargecollectionsofelectronicdocumentsarebecomingincreasinglyim-portant.Inrecentyears,topicmodels,whicharehi-erarchicalBayesianmodelsofdiscretedata,havebe-comeawidelyusedapproachforexploratoryandpre-dictiveanalysisoftext.Topicmodels,suchaslatentDirichletallocation(LDA)andthemoregeneraldis-cretecomponentanalysis[3,4],positthatasmallnumberofdistributionsoverwords,calledtopics,canbeusedtoexplaintheobservedcollection.LDAisaprobabilisticextensionoflatentsemanticindexing(LSI)[5]andprobabilisticlatentsemanticindexing(pLSI)[11].Owingtoitsformalgenerativesemantics,LDAhasbeenextendedandappliedtoauthorship[19],
Thisprovidesaninferentialspeed-upthatmakesitpossibleto tmodelsatvaryinggranularities.Asex-amples,journalarticlesmightbeexchangeablewithinanissue,anassumptionwhichismorerealisticthanonewheretheyareexchangeablebyyear.Otherdata,suchasnews,mightexperienceperiodsoftimewithoutanyobservation.WhilethedDTMrequiresrepresent-ingalltopicsforthediscretetickswithintheseperiods,thecDTMcananalyzesuchdatawithoutasacri ceofmemoryorspeed.WiththecDTM,thegranularitycanbechosentomaximizemodel tnessratherthantolimitcomputationalcomplexity.
WenotethatthecDTManddDTMarenottheonlytopicmodelstotaketimeintoconsideration.Topicsovertimemodels(TOT)[23]anddynamicmixturemodels(DMM)[25]alsoincludetimestampsintheanalysisofdocuments.TheTOTmodeltreatsthetimestampsasobservationsofthelatenttopics,whileDMMassumesthatthetopicmixtureproportionsofeachdocumentisdependentonprevioustopicmix-tureproportions.InbothTOTandDMM,thetopicsthemselvesareconstant,andthetimeinformationisusedtobetterdiscoverthem.Inthesettinghere,weareinterestedininferringevolvingtopics.
Therestofthepaperisorganizedasfollows.Insec-tion2wedescribethedDTManddevelopthecDTMindetail.Section3presentsane cientposteriorin-ferencealgorithmforthecDTMbasedonsparsevaria-tionalmethods.Insection4,wepresentexperimentalresultsontwonewscorpora.
2
Continuoustimedynamictopicmodels
Inatimestampeddocumentcollection,wewouldliketomodelitslatenttopicsaschangingthroughthecourseofthecollection.Innewsdata,forexample,asingletopicwillchangeasthestoriesassociatedwithitdevelop.Thediscrete-timedynamictopicmodel(dDTM)buildsontheexchangeabletopicmodeltoprovidesuchmachinery[2].InthedDTM,documentsaredividedintosequentialgroups,andthetopicsofeachsliceevolvefromthetopicsofthepreviousslice.Documentsinagroupareassumedexchangeable.Morespeci cally,atopicisrepresentedasadistribu-tionoverthe xedvocabularyofthecollection.ThedDTMassumesthatadiscrete-timestatespacemodelgovernstheevolutionofthenaturalparametersofthemultinomialdistributionsthatrepresentthetopics.(Recallthatthenaturalparametersofthemultino-mialarethelogsoftheprobabilitiesofeachitem.)Thisisatime-seriesextensiontothelogisticnormaldistribution
[26].
Figure1:GraphicalmodelrepresentationofthecDTM.TheevolutionofthetopicparametersβtisgovernedbyBrownianmotion.Thevariablestistheobservedtimestampofdocumentdt.
AdrawbackofthedDTMisthattimeisdiscretized.Iftheresolutionischosentobetoocoarse,thentheassumptionthatdocumentswithinatimestepareex-changeablewillnotbetrue.Iftheresolutionistoo ne,thenthenumberofvariationalparameterswillex-plodeasmoretimepointsareadded.Choosingthed …… 此处隐藏:21671字,全部文档内容请下载后查看。喜欢就下载吧 ……