计算机研究与发展
ISSN1000-1239PCN11-1777PTP
数据挖掘取样方法研究
胡文瑜
123
1,2
孙志挥 吴英杰
11,3
(东南大学计算机科学与工程学院 南京 210096)(福建工程学院计算机与信息科学系 福州 350108)(福州大学数学与计算机科学学院 福州 350108)(huwenyu@)
StudyofSamplingMethodsonDataMiningandStreamMining
HuWenyu1,2,SunZhihui1,WuYingjie1,3
123
(SchoolofComputerScienceandEngineering,SoutheastUniversity,Nanjing210096)
(DepartmentofComputerandInformationScience,FujianUniversityofTechnology,Fuzhou350108)(CollegeofMathematicsandComputerScience,FuzhouUniversity,Fuzhou350108)
Abstract Samplingisanefficientandmostwidely-usedapproximationtechnique.Itenableslotsof
algorithmstobeappliedtohugedatasetbyuseofscalingdowndramaticallydatasetfordataminingandstreamingmining.Throughoutthedetailedreview,akindoftaxonomicframeofsamplingalgorithmsbasedonuniformsamplingandbiasedsamplingispresented;meanwhile,analysis,comparisonsandevaluationsofrepresentativesamplingalgorithmssuchasreservoirsampling,concisesampling,countsampling,chain-sampling,DVsamplingandsoonareperformed.Duetothelimitationsofuniformsamplinginsomeapplications)querieswithrelativelylowselectivity,outlierdetectioninlargemultidimensionaldatasets,andclusteringoverdatastreamswithskewedZipfdistribution,theimportanceofneedforusingbiasedsamplingmethodsinthesescenariosisfullydissertated.Inadditiontolistingsuccessfulapplicationsofsamplingtechniquesindatamining,statisticsestimatingandstreammininguptonow,wesurveytheapplicationanddevelopmentofsamplingtechniques,especiallythosetraditionalclassicsamplingtechniquessuchasprogressivesampling,adaptivesampling,stratifiedsamplingandtwo-phasesamplingetc.Finally,futurechallengesanddirectionswithrespecttodatastreamsamplingarefurtherdiscussed.
Keywords datamining;uniformsampling;biasedsampling;datastream;synopsisdatastructure
摘 要
取样是一种通用有效的近似技术.在数据挖掘研究中,取样方法可显著减小所处理数据集的规
模,使得众多数据挖掘算法得以应用到大规模数据集以及数据流数据上.通过对应用于数据挖掘领域的代表性取样方法的比较研究和分析总结,提出了一个取样算法分类框架.在指出了均匀取样局限性的基础上阐述了某些应用场景中选用偏倚取样方法的必要性,综述了取样技术在数据挖掘领域的应用研究与应用发展,最后对数据流挖掘取样方法面临的挑战和发展方向进行了展望.
关键词
数据挖掘;均匀取样;偏倚取样;数据流;概要数据结构
中图法分类号 TP311.13;TP391
收稿日期:2009-12-15;修回日期:2010-05-24
(;(