手机版

连续动态主题模型_WangBleiHeckerman2008

时间:2025-04-24   来源:未知    
字号:

ContinuousTimeDynamicTopicModels

ChongWangComputerScienceDept.PrincetonUniversityPrinceton,NJ08540DavidBlei

ComputerScienceDept.PrincetonUniversityPrinceton,NJ08540DavidHeckermanMicrosoftResearchOneMicrosoftWayRedmond,WA98052

Abstract

Inthispaper,wedevelopthecontinuoustimedynamictopicmodel(cDTM).ThecDTMisadynamictopicmodelthatusesBrownianmotiontomodelthelatenttopicsthroughasequentialcollectionofdocuments,wherea“topic”isapatternofwordusethatweexpecttoevolveoverthecourseofthecol-lection.Wederiveane cientvariationalapproximateinferencealgorithmthattakesadvantageofthesparsityofobservationsintext,apropertythatletsuseasilyhan-dlemanytimepoints.IncontrasttothecDTM,theoriginaldiscrete-timedynamictopicmodel(dDTM)requiresthattimebediscretized.Moreover,thecomplexityofvari-ationalinferenceforthedDTMgrowsquicklyastimegranularityincreases,adrawbackwhichlimits ne-graineddiscretization.WedemonstratethecDTMontwonewscorpora,reportingbothpredictiveperplexityandthenoveltaskoftimestampprediction.

email[15],computervision[7],bioinformatics[18],andinformationretrieval[24].Foragoodreview,see[8].Mosttopicmodelsassumethedocumentsareex-changeableinthecollection,i.e.,thattheirprobabilityisinvarianttopermutation.Manydocumentcollec-tions,suchasnewsorscienti cjournals,evolveovertime.Inthispaper,wedevelopthecontinuoustimedynamictopicmodel(cDTM),whichisanextensionofthediscretedynamictopicmodel(dDTM)[2].Givenasequenceofdocuments,weinferthelatenttopicsandhowtheychangethroughthecourseofthecollection.ThedDTMusesastatespacemodelonthenaturalpa-rametersofthemultinomialdistributionsthatrepre-sentthetopics.Thisrequiresthattimebediscretizedintoseveralperiods,andwithineachperiodLDAisusedtomodelitsdocuments.In[2],theauthorsan-alyzethejournalSciencefrom1880-2002,assumingthatarticlesareexchangeablewithineachyear.WhilethedDTMisapowerfulmodel,thechoiceofdiscretiza-tiona ectsthememoryrequirementsandcomputa-tionalcomplexityofposteriorinference.Thislargelydeterminestheresolutionatwhichto tthemodel.Toresolvetheproblemofdiscretization,weconsidertimetobecontinuous.Thecontinuoustimedynamictopicmodel(cDTM)proposedherereplacesthedis-cretestatespacemodelofthedDTMwithitscontinu-ousgeneralization,Brownianmotion[14].ThecDTMgeneralizesthedDTMinthattheonlydiscretizationitmodelsistheresolutionatwhichthetimestampsofthedocumentsaremeasured.

ThecDTMmodelwill,generally,introducemanymorelatentvariablesthanthedDTM.However,thisseem-inglymorecomplicatedmodelissimplerandmoree -cientto t.Aswewillseebelow,fromthisformulationthevariationalposteriorinferenceprocedurecantakeadvantageofthenaturalsparsityoftext,thefactthatnotallvocabularywordsareusedateachmeasuredtimestep.Infact,astheresolutiongets ner,fewerandfewerwordsareused.

1Introduction

Toolsforanalyzingandmanaginglargecollectionsofelectronicdocumentsarebecomingincreasinglyim-portant.Inrecentyears,topicmodels,whicharehi-erarchicalBayesianmodelsofdiscretedata,havebe-comeawidelyusedapproachforexploratoryandpre-dictiveanalysisoftext.Topicmodels,suchaslatentDirichletallocation(LDA)andthemoregeneraldis-cretecomponentanalysis[3,4],positthatasmallnumberofdistributionsoverwords,calledtopics,canbeusedtoexplaintheobservedcollection.LDAisaprobabilisticextensionoflatentsemanticindexing(LSI)[5]andprobabilisticlatentsemanticindexing(pLSI)[11].Owingtoitsformalgenerativesemantics,LDAhasbeenextendedandappliedtoauthorship[19],

Thisprovidesaninferentialspeed-upthatmakesitpossibleto tmodelsatvaryinggranularities.Asex-amples,journalarticlesmightbeexchangeablewithinanissue,anassumptionwhichismorerealisticthanonewheretheyareexchangeablebyyear.Otherdata,suchasnews,mightexperienceperiodsoftimewithoutanyobservation.WhilethedDTMrequiresrepresent-ingalltopicsforthediscretetickswithintheseperiods,thecDTMcananalyzesuchdatawithoutasacri ceofmemoryorspeed.WiththecDTM,thegranularitycanbechosentomaximizemodel tnessratherthantolimitcomputationalcomplexity.

WenotethatthecDTManddDTMarenottheonlytopicmodelstotaketimeintoconsideration.Topicsovertimemodels(TOT)[23]anddynamicmixturemodels(DMM)[25]alsoincludetimestampsintheanalysisofdocuments.TheTOTmodeltreatsthetimestampsasobservationsofthelatenttopics,whileDMMassumesthatthetopicmixtureproportionsofeachdocumentisdependentonprevioustopicmix-tureproportions.InbothTOTandDMM,thetopicsthemselvesareconstant,andthetimeinformationisusedtobetterdiscoverthem.Inthesettinghere,weareinterestedininferringevolvingtopics.

Therestofthepaperisorganizedasfollows.Insec-tion2wedescribethedDTManddevelopthecDTMindetail.Section3presentsane cientposteriorin-ferencealgorithmforthecDTMbasedonsparsevaria-tionalmethods.Insection4,wepresentexperimentalresultsontwonewscorpora.

2

Continuoustimedynamictopicmodels

Inatimestampeddocumentcollection,wewouldliketomodelitslatenttopicsaschangingthroughthecourseofthecollection.Innewsdata,forexample,asingletopicwillchangeasthestoriesassociatedwithitdevelop.Thediscrete-timedynamictopicmodel(dDTM)buildsontheexchangeabletopicmodeltoprovidesuchmachinery[2].InthedDTM,documentsaredividedintosequentialgroups,andthetopicsofeachsliceevolvefromthetopicsofthepreviousslice.Documentsinagroupareassumedexchangeable.Morespeci cally,atopicisrepresentedasadistribu-tionoverthe xedvocabularyofthecollection.ThedDTMassumesthatadiscrete-timestatespacemodelgovernstheevolutionofthenaturalparametersofthemultinomialdistributionsthatrepresentthetopics.(Recallthatthenaturalparametersofthemultino-mialarethelogsoftheprobabilitiesofeachitem.)Thisisatime-seriesextensiontothelogisticnormaldistribution

[26].

Figure1:GraphicalmodelrepresentationofthecDTM.TheevolutionofthetopicparametersβtisgovernedbyBrownianmotion.Thevariablestistheobservedtimestampofdocumentdt.

AdrawbackofthedDTMisthattimeisdiscretized.Iftheresolutionischosentobetoocoarse,thentheassumptionthatdocumentswithinatimestepareex-changeablewillnotbetrue.Iftheresolutionistoo ne,thenthenumberofvariationalparameterswillex-plodeasmoretimepointsareadded.Choosingthed …… 此处隐藏:21671字,全部文档内容请下载后查看。喜欢就下载吧 ……

连续动态主题模型_WangBleiHeckerman2008.doc 将本文的Word文档下载到电脑,方便复制、编辑、收藏和打印
    ×
    二维码
    × 游客快捷下载通道(下载后可以自由复制和排版)
    VIP包月下载
    特价:29 元/月 原价:99元
    低至 0.3 元/份 每月下载150
    全站内容免费自由复制
    VIP包月下载
    特价:29 元/月 原价:99元
    低至 0.3 元/份 每月下载150
    全站内容免费自由复制
    注:下载文档有可能出现无法下载或内容有问题,请联系客服协助您处理。
    × 常见问题(客服时间:周一到周五 9:30-18:00)