赢得netflix推荐系统大奖的算法
Frequencieshelpindistinguishingdayswhenusersratealotinabulk.Typically,suchratingsaregivennotcloselytotheactualwatchingday.Ourtheoryisthatwhenratinginabulk,usersstillre ecttheirnormalpreferences.However,certainmoviesexhibitanasymmetricattitudetowardsthem.Somepeoplelikethem,andwillrememberthemforlongastheirall-timefavorites.Ontheotherhand,somepeopledislikethemandjusttendtoforgetthem.Thus,whengivingbulkratings,onlythosewiththepositiveapproachwillmarkthemastheirfavorites,whilethosedislikingthemwillnotmentionthem.Suchabehaviorisexpectedtowardsmostpopularmovies,whichcanbeeitherrememberedasverygoodorjustbeforgotten.Asimilarphenomenoncanalsohappenwithanegativeapproach.Somemoviesarenotoriouslybad,andpeoplewhodidnotlikethemalwaysgivethemasnegativeexamples,indicatingwhattheydonotwanttowatch.However,fortheotherpartofthepopulation,wholikedthosemovies,theyarenotgoingtoberememberedlongassalientpositiveexamples.Thus,whenratinginbulk,longafterwatchingthemovie,onlythosewhodislikedthemoviewillrateit.
Thisexplainswhysuchbiasesshouldbeassociatedwithmovies,notwithusers.Thisalsoexplainswhymostoftheeffectdisappearswhenaddingtheinteractionterms,whichalready“understand”thattheuserisofthetypethatlikes/dislikesthemovie.Inotherwords,wehypothesizethathighfrequencies(orbulkratings)donotrepresentmuchchangeinpeople’staste,butmostlyabiasedselectionofmoviestoberated–somemoviesarenaturalcandidatesas“badexamples”,whileothersarenatural“goodexamples”.Webelievethatfurthervalidatingourhypothesisbearspracticalimplications.If,indeed,frequenciesrepresentbiasedselection,theyshouldbetreatedascapturingnoise,whichneedstogetisolatedoutwhenmakingrecommendations.
Finally,weshouldcommentthatamovierentersuchasNet ix,mighthaveadditionaldatasourcesthatcomplementfrequencies.Forexample,dataontimepasedsinceactualwatchingdate,oronwhetherratingswereenteredinresponsetoagivenquestionnaireorinitiatedbytheuser.
C.Predictingfuturedays
Ourmodelsincludeday-speci cparameters.Weareoftenaskedhowthesemodelscanbeusedforpredictingratingsinthefuture,onnewdatesforwhichwecannottraintheday-speci cparameters?Thesimpleansweristhatforthosefuture(untrained)dates,theday-speci cparametersshouldtaketheirdefaultvalue.Inparticularfor(11),cu(tui)issettocu,andbu,tuiissettozero.Yet,onewonders,ifwecannotusetheday-speci cparametersforpredictingthefuture,whyaretheygoodatall?Afterall,predictionisinterestingonlywhenitisaboutthefuture.Tofurthersharpenthequestion,weshouldmentionthefactthattheNet ixQualifyingsetincludesmanyratingsondatesforwhichwehavenootherratingbythesameuserandhenceday-speci cparameterscannotbeexploited.Toanswerthis,noticethatourtemporalmodelingmakesnoattempttocapturefuturechanges.Allitistryingtodoistocapturetransienttemporaleffects,whichhadasigni cantin uenceonpastuserfeedback.Whensucheffectsareidenti- edtheymustbetuneddown,sothatwecanmodelthemore
4
enduringsignal.Thisallowsourmodeltobettercapturethelong-termcharacteristicsofthedata,whilelettingdedicatedparametersabsorbshortterm uctuations.Forexample,ifausergavemanyhigherthanusualratingsonaparticularsingleday,ourmodelsdiscountthosebyaccountingforapossibleday-speci cgoodmood,whichdoesnotre ectsthelongertermbehaviorofthisuser.Thisway,theday-speci cparametersaccomplishakindofdatacleaning,whichimprovespredictionoffuturedates.D.What’sintheblend?
TheRMSE=0.9555resultofmodel(10)isincludedintheblend.Tolearntheinvolvedparameters,bu,αu,but,bi,bi,Bin(t),cu,andcutoneshouldminimizetheregularizedsquarederroronthetrainingset.Learningisdonebyastochasticgradientdescentalgorithmrunningfor30iterations.Weuseseparatelearningrate(stepsize)andregularization(weightdecay)oneachkindoflearnedparameter,byminimizingthecostfunction
bmin,α∑
rui µ bu αu·devu(tui) bu,t(12) ,c
ui
(u,i)∈K
(bi+bi,Bin(tui))·(cu+cu,tui) 2
+λab2u+λbαu2
+
λcb2u,t2b2i,Bin(t22ui+λdbi+λeui)+λf(cu 1)+λgcu,tui.
Actualvaluesofthelearningrates
andregularizationcon-stants(λa,λb,...,λg)areasfollows:
bubutαubi
bi,Bin(t)cucutreg×10
235e-150003
10
1
5e-1
Noticethatregularizationshrinksparameterstowardszero,withoneexception.Themultipliercuisshrunktowards1,i.e.,wepenalize(cu 1)2,ratherthanc2parametersareinitializedtozero,exceptu.Similarly,alllearnedcuthatisinitializedto1.
Theblendalsoincludestheresultofthemoreaccuratebaselinepredictor(11).Infact,thisistheonlycasewhereweresortedtoanautomaticparametertuner(APT)to ndthebestconstants(learningrates,regularization,andlogbasis).Speci cally,wewereusingAPT1,whichisdescribedin[13].ThereasonweusedAPThereistwofold.First,thisbaselinepredictorcomponentisembeddedinourmorecomprehensivemodels(describedlater).Therefore,itisworthwhiletohighlyoptimizeit.Second,thisisasmallquickly-trainedmodel.Sowecouldeasilyaffordmanyhundredsofautomaticexecutionsseekingoptimalsettings.Still,itisworthmentioningthebene tofAPTwasanRMSEreductionof(only)0.0016overourinitialmanualsettings.
TheparametersoftheRMSE=0.9278resultofmodel(11)werelearnedwitha40-iterationstochasticgradientdescentprocess,withthefollowingconstantsgoverningtheblearningbαofeachkindofparameter:
i,Bin(t)cucutbi,fuutubibreg×102
2.55.2313952.559.294.761.901.10e-6
Thelogbasis,a,ter,werefertothismodelas[PQ1].