Abstract. A data cube is a popular organization for summary data. A cube is simply a multidimensional structure that contains in each cell an aggregate value, i.e., the result of applying an aggregate function to an underlying relation. In practical situat
266´ANDWUBARBARAWehavechosentoselectourmodelsfromamongthefollowingset:
Themodelwithoutanydimensionaleffects.Thatis,theonethatonlyincludestheγ. Thecompleteindependencemodelwithonlymain-effectsterms.Thatis,themodelthatincludestheγandγiA,γjB,.... Allthemodelswithallpossiblek-factoreffectsandnohigher-orderforms,where2≤k≤MAXLVL,withMAXLVLisapre-establishedvaluelessthanorequalton.
Allthemodelsweconsideraresaturated.Fromallthesemodelsweselectastartmodelwhichhasthebestcompressionratio:thatis,thefewestparametersandoutliers.Thisselectionisbyenumeration:wecomputetheparametersandtheoutliersforeachmodelandselectthebestamongthem.LetuscallthisstartmodelMs.
Wethentrytoimproveonthismodelbyusingaperturbationheuristic.Theheuristicconsistsofthreesteps,asfollows:
Backwardelimination:parethenumberofparametersandoutliersofthenewmodelwiththosefoundforMs,ifthenumberissmaller,continuedoingbackwardelimination(selectingthenextfactorwiththesmalleststandardvarianceandsuppressingit).Otherwisestop.Letusassumethebestmodelwe ndwithbackwardeliminationisMb.
Forwardselection:choosethefactorwiththebiggeststandardvarianceamongthefactorsnotpresentinMsandaddittoMs,toformanewmodel.Asbefore,comparethenumberofparametersandoutliersofthenewmodelwiththosefoundforMs,ifthenumberissmaller,continuedoingforwardselection.Otherwisestop.Letusassumethatthebestmodelwe ndwithforwardselectionisMf.
ComparetheperformanceofMbwiththatofMf,keepingthebestonebetweenthetwo.Accordingtoourexperience,althoughwemaynotgettheoptimalmodelfollowingthisheuristic,wedogetaverygoodapproximation.TheadvantageofthisheuristicisthatweonlyneedtoperformanumberofpassesoverthedatawearemodelingequaltoMAXLVLplusthenumberofforwardtries,plusthenumberofbackwardtries.Itisimportanttonoticethatwearemodelingthesubsetofdatacontainedinachunk,whichismuchsmallerthantheentirecorecuboidandusually tsinmemory.Noticethatifthisisthecaseforeverychunk,thenweonlyneedtobringeachcellinthecuboidoncetomemory.
2.3.Queryingtheapproximatecube
Arangequeryoverthecuboidcanbedecomposedastheunionofseveraldisjointqueries,eachspanningachunkinthecuboid.Thatbeingthecase,foreachoneofthedisjointsub-queriestherearetwopossibilities:
Thesub-querycompletelyincludesitsrespectivechunk.Inthiscase,theanswerofthesub-querycanbeimmediatelyobtainedfromthechunkdescription,whichcontainstheaggregatedvalueofallthecellsinthechunk.Noticethatthisvalueisfreeoferrorandthat