在数据挖掘中,异常检测(也称为异常性检测[1])是通过与大多数数据显著不同而引起怀疑的罕见项目、事件或观察结果的识别。[1]通常,异常项目会转化为某种问题,如银行欺诈、结构缺陷、医疗问题或文本错误。异常也被称为异常值、新奇、噪音、偏差和异常。[2]
特别是,在滥用异常检测和网络入侵检测的背景下,有吸引力的对象通常不是罕见的 对象,但出乎意料的会在活动中爆发。这种模式不符合异常值作为罕见对象的常见统计定义,并且许多异常值检测方法(特别是无监督方法)将无法在这种数据上使用,除非它已经被适当地聚集。相反,聚类分析算法可能能够检测由这些模式形成的微聚类。[3]
存在三大类异常检测技术。[4]假设数据集中的大多数实例是正常的,无监督异常检测技术通过寻找最不适合数据集其余部分的实例来检测未标记测试数据集中的异常。监督异常检测技术需要被标记为“正常”和“异常”的数据集,并且涉及到训练分类器(许多其他统计分类问题的关键区别是异常检测的固有不平衡性质)。半监督异常检测技术从给定的正常 训练数据集构建一个表示正常行为的模型,然后测试学习模型生成测试实例的可能性。
文献中已经提出了几种异常检测技术。[7]一些流行的技术有:
不同方法的性能在很大程度上取决于数据集和参数,当在许多数据集和参数之间进行比较时,方法相对于另一种方法没有什么系统优势。[28][29]
^Zimek, Arthur; Schubert, Erich (2017), "Outlier Detection", Encyclopedia of Database Systems, Springer New York, pp. 1–5, doi:10.1007/978-1-4899-7993-3_80719-1, ISBN 9781489979933.
^Hodge, V. J.; Austin, J. (2004). "A Survey of Outlier Detection Methodologies" (PDF). Artificial Intelligence Review. 22 (2): 85–126. CiteSeerX 10.1.1.318.4023. doi:10.1007/s10462-004-4304-y..
^Dokas, Paul; Ertoz, Levent; Kumar, Vipin; Lazarevic, Aleksandar; Srivastava, Jaideep; Tan, Pang-Ning (2002). "Data mining for network intrusion detection" (PDF). Proceedings NSF Workshop on Next Generation Data Mining..
^Chandola, V.; Banerjee, A.; Kumar, V. (2009). "Anomaly detection: A survey". ACM Computing Surveys. 41 (3): 1–58. doi:10.1145/1541880.1541882..
^Tomek, Ivan (1976). "An Experiment with the Edited Nearest-Neighbor Rule". IEEE Transactions on Systems, Man, and Cybernetics. 6 (6): 448–452. doi:10.1109/TSMC.1976.4309523..
^Smith, M. R.; Martinez, T. (2011). "Improving classification accuracy by identifying and removing instances that should be misclassified" (PDF). The 2011 International Joint Conference on Neural Networks. p. 2690. CiteSeerX 10.1.1.221.1371. doi:10.1109/IJCNN.2011.6033571. ISBN 978-1-4244-9635-8..
^Zimek, Arthur; Filzmoser, Peter (2018). "There and back again: Outlier detection between statistical reasoning and data mining algorithms". Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery. 8 (6): e1280. doi:10.1002/widm.1280. ISSN 1942-4787..
^Knorr, E. M.; Ng, R. T.; Tucakov, V. (2000). "Distance-based outliers: Algorithms and applications". The VLDB Journal the International Journal on Very Large Data Bases. 8 (3–4): 237–253. CiteSeerX 10.1.1.43.1842. doi:10.1007/s007780050006..
^Ramaswamy, S.; Rastogi, R.; Shim, K. (2000). Efficient algorithms for mining outliers from large data sets. Proceedings of the 2000 ACM SIGMOD international conference on Management of data – SIGMOD '00. p. 427. doi:10.1145/342009.335437. ISBN 1-58113-217-4..
^Angiulli, F.; Pizzuti, C. (2002). Fast Outlier Detection in High Dimensional Spaces. Principles of Data Mining and Knowledge Discovery. Lecture Notes in Computer Science. 2431. p. 15. doi:10.1007/3-540-45681-3_2. ISBN 978-3-540-44037-6..
^Breunig, M. M.; Kriegel, H.-P.; Ng, R. T.; Sander, J. (2000). LOF: Identifying Density-based Local Outliers (PDF). Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data. SIGMOD. pp. 93–104. doi:10.1145/335191.335388. ISBN 1-58113-217-4..
^Liu, Fei Tony; Ting, Kai Ming; Zhou, Zhi-Hua (December 2008). Isolation Forest. 2008 Eighth IEEE International Conference on Data Mining (in English). pp. 413–422. doi:10.1109/ICDM.2008.17. ISBN 9780769535029.CS1 maint: Unrecognized language (link).
^Schubert, E.; Zimek, A.; Kriegel, H. -P. (2012). "Local outlier detection reconsidered: A generalized view on locality with applications to spatial, video, and network outlier detection". Data Mining and Knowledge Discovery. 28: 190–237. doi:10.1007/s10618-012-0300-z..
^Kriegel, H. P.; Kröger, P.; Schubert, E.; Zimek, A. (2009). Outlier Detection in Axis-Parallel Subspaces of High Dimensional Data. Advances in Knowledge Discovery and Data Mining. Lecture Notes in Computer Science. 5476. p. 831. doi:10.1007/978-3-642-01307-2_86. ISBN 978-3-642-01306-5..
^Kriegel, H. P.; Kroger, P.; Schubert, E.; Zimek, A. (2012). Outlier Detection in Arbitrarily Oriented Subspaces. 2012 IEEE 12th International Conference on Data Mining. p. 379. doi:10.1109/ICDM.2012.21. ISBN 978-1-4673-4649-8..
^Fanaee-T, H.; Gama, J. (2016). "Tensor-based anomaly detection: An interdisciplinary survey". Knowledge-Based Systems. 98: 130–147. doi:10.1016/j.knosys.2016.01.027..
^Zimek, A.; Schubert, E.; Kriegel, H.-P. (2012). "A survey on unsupervised outlier detection in high-dimensional numerical data". Statistical Analysis and Data Mining. 5 (5): 363–387. doi:10.1002/sam.11161..
^Schölkopf, B.; Platt, J. C.; Shawe-Taylor, J.; Smola, A. J.; Williamson, R. C. (2001). "Estimating the Support of a High-Dimensional Distribution". Neural Computation. 13 (7): 1443–71. CiteSeerX 10.1.1.4.4106. doi:10.1162/089976601750264965. PMID 11440593..
^Hawkins, Simon; He, Hongxing; Williams, Graham; Baxter, Rohan (2002). "Outlier Detection Using Replicator Neural Networks". Data Warehousing and Knowledge Discovery. Lecture Notes in Computer Science. 2454. pp. 170–180. CiteSeerX 10.1.1.12.3366. doi:10.1007/3-540-46145-0_17. ISBN 978-3-540-44123-6..
^He, Z.; Xu, X.; Deng, S. (2003). "Discovering cluster-based local outliers". Pattern Recognition Letters. 24 (9–10): 1641–1650. CiteSeerX 10.1.1.20.4242. doi:10.1016/S0167-8655(03)00003-5..
^Campello, R. J. G. B.; Moulavi, D.; Zimek, A.; Sander, J. (2015). "Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection". ACM Transactions on Knowledge Discovery from Data. 10 (1): 5:1–51. doi:10.1145/2733381..
^Lazarevic, A.; Kumar, V. (2005). Feature bagging for outlier detection. Proc. 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining. pp. 157–166. CiteSeerX 10.1.1.399.425. doi:10.1145/1081870.1081891. ISBN 978-1-59593-135-1..
^Nguyen, H. V.; Ang, H. H.; Gopalkrishnan, V. (2010). Mining Outliers with Ensemble of Heterogeneous Detectors on Random Subspaces. Database Systems for Advanced Applications. Lecture Notes in Computer Science. 5981. p. 368. doi:10.1007/978-3-642-12026-8_29. ISBN 978-3-642-12025-1..
^Kriegel, H. P.; Kröger, P.; Schubert, E.; Zimek, A. (2011). Interpreting and Unifying Outlier Scores. Proceedings of the 2011 SIAM International Conference on Data Mining. pp. 13–24. CiteSeerX 10.1.1.232.2719. doi:10.1137/1.9781611972818.2. ISBN 978-0-89871-992-5..
^Schubert, E.; Wojdanowski, R.; Zimek, A.; Kriegel, H. P. (2012). On Evaluation of Outlier Rankings and Outlier Scores. Proceedings of the 2012 SIAM International Conference on Data Mining. pp. 1047–1058. doi:10.1137/1.9781611972825.90. ISBN 978-1-61197-232-0..
^Zimek, A.; Campello, R. J. G. B.; Sander, J. R. (2014). "Ensembles for unsupervised outlier detection". ACM SIGKDD Explorations Newsletter. 15: 11–22. doi:10.1145/2594473.2594476..
^Zimek, A.; Campello, R. J. G. B.; Sander, J. R. (2014). Data perturbation for outlier detection ensembles. Proceedings of the 26th International Conference on Scientific and Statistical Database Management – SSDBM '14. p. 1. doi:10.1145/2618243.2618257. ISBN 978-1-4503-2722-0..
^Campos, Guilherme O.; Zimek, Arthur; Sander, Jörg; Campello, Ricardo J. G. B.; Micenková, Barbora; Schubert, Erich; Assent, Ira; Houle, Michael E. (2016). "On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study". Data Mining and Knowledge Discovery. 30 (4): 891. doi:10.1007/s10618-015-0444-8. ISSN 1384-5810..
^Anomaly detection benchmark data repository of the Ludwig-Maximilians-Universität München; Mirror at University of São Paulo..
^Denning, D. E. (1987). "An Intrusion-Detection Model" (PDF). IEEE Transactions on Software Engineering. SE-13 (2): 222–232. CiteSeerX 10.1.1.102.5127. doi:10.1109/TSE.1987.232894..
^Teng, H. S.; Chen, K.; Lu, S. C. (1990). Adaptive real-time anomaly detection using inductively generated sequential patterns (PDF). Proceedings of the IEEE Computer Society Symposium on Research in Security and Privacy. pp. 278–284. doi:10.1109/RISP.1990.63857. ISBN 978-0-8186-2060-7..
^Jones, Anita K.; Sielken, Robert S. (1999). "Computer System Intrusion Detection: A Survey". Technical Report, Department of Computer Science, University of Virginia, Charlottesville, VA. CiteSeerX 10.1.1.24.7802..
暂无