手机版

Estimating the quality of data in relational databases(7)

时间:2025-04-29   来源:未知    
字号:

aviewmaporarelationmap,asappropriate.Now,thetaskistopartitionthistwo-dimensionalarrayintoareasinwhichelementsaredistributedhomogeneouslywithrespecttoourqualitymeasures.

Notethatthecorrectnessofaparticularnonkeyattributevaluecanbedeterminedonlyinreferencetothekeyattributeofthattuple,i.e.,indeterminingwhetheraspeci ccellshouldbe0or1weconsiderthecorrectnessofthepair:(keyvalue;nonkeyvalue)determiningthecorrectnessofanattributevalue.Thepairiscorrectifandonlyifbothelementsofthepairarecorrect.Thismeans,inparticular,thatifakeyattributevalueisincorrect,thenallpairscorrespondingtothiskeyattributevalueareconsideredincorrect.

ThetechniqueweuseforpartitioningtheviewmapisanonparametricstatisticalmethodcalledCART(Classi cationandRegressionTrees)[2].Thismethodhasbeenwidelyusedfordataanalysisinbiology,socialscience,environmentalresearch,andpatternrecognition.Closertoourarea,thismethodwasusedin[3]forestimatingtheselectivityofselectionqueries.Weassumethattuplesandattributesofarelationareordereduniquely.

4.2HomogeneityMeasure

Intuitively,aviewisperfectlyhomogeneouswithrespecttoagivenpropertyifeverysubviewoftheviewcontainsthesameproportionofpairswiththispropertyastheviewitself.Moreover,themorehomogeneousaview,thecloseritsdistributionofthepairswiththegivenpropertyistothedistributionintheperfectlyhomogeneousview.Hence,thedi erencebetweentheproportionofthepairswiththegivenpropertyintheviewitselfandineachofitssubviewscanbeusedtomeasurethedegreeofhomogeneityofthegivenview.

Speci cally,letv¯denoteanextensionofaviewofarelationinastoreddatabase,letv1,...,vNbethesetofallpossibleprojection-selectionviewsofv¯,lets(¯v)ands(vi)denotetheproportionofpairsinviewsv¯andvi(i=1,...,N),respectively,thatoccurintheircorrespondingidealrepresentations(i.e.,proportionsofcorrectpairsintheseviews).Then1 (s(¯v) s(vi))2

Nvi v¯

measuresthehomogeneityoftheviewv¯withrespecttosoundness.Thehomogeneitywithrespecttocompletenessisde nedanalogously.Similarmeasuresofhomogeneitywerepro-posedin[6,3].

Duetothelargenumberofpossibleviews,computationofthesemeasuresisoftenpro-hibitivelyexpensive.TheGiniindex[2,3]wasproposedasasimplealternativetothesehomogeneitymeasures.

Consideraviewv¯andarelationmapM.WecallthepartofMthatcorrespondsto

3v¯anode.TheGiniindexofthisnode,denotedG(¯v),is2p(1 p),wherepdenotesthe3Weusethetermsnodeandviewinterchangeably.

…… 此处隐藏:311字,全部文档内容请下载后查看。喜欢就下载吧 ……
Estimating the quality of data in relational databases(7).doc 将本文的Word文档下载到电脑,方便复制、编辑、收藏和打印
×
二维码
× 游客快捷下载通道(下载后可以自由复制和排版)
VIP包月下载
特价:29 元/月 原价:99元
低至 0.3 元/份 每月下载150
全站内容免费自由复制
VIP包月下载
特价:29 元/月 原价:99元
低至 0.3 元/份 每月下载150
全站内容免费自由复制
注:下载文档有可能出现无法下载或内容有问题,请联系客服协助您处理。
× 常见问题(客服时间:周一到周五 9:30-18:00)