异常检测

编辑

在数据挖掘中，异常检测（也称为异常性检测^[1]）是通过与大多数数据显著不同而引起怀疑的罕见项目、事件或观察结果的识别。^[1]通常，异常项目会转化为某种问题，如银行欺诈、结构缺陷、医疗问题或文本错误。异常也被称为异常值、新奇、噪音、偏差和异常。^[2]

特别是，在滥用异常检测和网络入侵检测的背景下，有吸引力的对象通常不是罕见的 对象，但出乎意料的会在活动中爆发。这种模式不符合异常值作为罕见对象的常见统计定义，并且许多异常值检测方法（特别是无监督方法）将无法在这种数据上使用，除非它已经被适当地聚集。相反，聚类分析算法可能能够检测由这些模式形成的微聚类。^[3]

存在三大类异常检测技术。^[4]假设数据集中的大多数实例是正常的，无监督异常检测技术通过寻找最不适合数据集其余部分的实例来检测未标记测试数据集中的异常。监督异常检测技术需要被标记为“正常”和“异常”的数据集，并且涉及到训练分类器（许多其他统计分类问题的关键区别是异常检测的固有不平衡性质）。半监督异常检测技术从给定的正常训练数据集构建一个表示正常行为的模型，然后测试学习模型生成测试实例的可能性。

目录编辑

1 应用程序编辑

异常检测适用于各种领域，例如入侵检测、欺诈检测、故障检测、系统健康监控、传感器网络中的事件检测以及检测生态系统干扰。它通常用于预处理以从数据集中移除异常数据。在监督学习中，从数据集中移除异常数据通常会导致统计精度的显著提高。^[5]^[6]

2 流行技术编辑

文献中已经提出了几种异常检测技术。^[7]一些流行的技术有:

基于密度的技术（k-最近邻，^[8]^[9]^[10]局部异常因子，^[11]孤立森林，^[12]以及这个概念的更多变体^[13]。）
基于子空间，^[14] 基于相关性^[15]和基于张量的^[16]高维数据的异常检测。^[17]
一类支持向量机。^[18]
复制器神经网络。^[19]，自动编码器
贝叶斯网络。^[19]
隐马尔可夫模型(HMMs)。^[19]
基于聚类分析的离群点检测。^[20]^[21]
偏离关联规则和频繁项目集。
基于模糊逻辑的离群点检测。
集成技术，使用特征打包，^[22]^[23]分数标准化^[24]^[25] 和不同的多样性来源。^[26]^[27]

不同方法的性能在很大程度上取决于数据集和参数，当在许多数据集和参数之间进行比较时，方法相对于另一种方法没有什么系统优势。^[28]^[29]

3 数据安全应用编辑

多萝西·丹宁于1986年提出了入侵检测系统的异常检测。^[30]入侵检测系统的异常检测通常通过阈值和统计来完成，但也可以通过软计算和归纳学习来完成。^[31]1999年提出的统计类型包括用户、工作站、网络、远程主机、用户组和基于频率、均值、方差、协方差和标准差的程序的概况。^[32]入侵检测中异常检测的对应方是误用检测。

4 软件编辑

ELKI是一个开源的Java数据挖掘工具包，它包含了几种异常检测算法，以及它们的索引加速。

5 数据集编辑

Anomaly detection benchmark data repository 慕尼黑大学的路德维希-马克西米利安； Mirror 在圣保罗大学。
ODDS –异常值检测数据集：公开在不同领域具有基本事实的可用的大量异常值检测数据集。

参考文献

[1]
^Zimek, Arthur; Schubert, Erich (2017), "Outlier Detection", Encyclopedia of Database Systems, Springer New York, pp. 1–5, doi:10.1007/978-1-4899-7993-3_80719-1, ISBN 9781489979933.
[2]
^Hodge, V. J.; Austin, J. (2004). "A Survey of Outlier Detection Methodologies" (PDF). Artificial Intelligence Review. 22 (2): 85–126. CiteSeerX 10.1.1.318.4023. doi:10.1007/s10462-004-4304-y..
[3]
^Dokas, Paul; Ertoz, Levent; Kumar, Vipin; Lazarevic, Aleksandar; Srivastava, Jaideep; Tan, Pang-Ning (2002). "Data mining for network intrusion detection" (PDF). Proceedings NSF Workshop on Next Generation Data Mining..
[4]
^Chandola, V.; Banerjee, A.; Kumar, V. (2009). "Anomaly detection: A survey". ACM Computing Surveys. 41 (3): 1–58. doi:10.1145/1541880.1541882..
[5]
^Tomek, Ivan (1976). "An Experiment with the Edited Nearest-Neighbor Rule". IEEE Transactions on Systems, Man, and Cybernetics. 6 (6): 448–452. doi:10.1109/TSMC.1976.4309523..
[6]
^Smith, M. R.; Martinez, T. (2011). "Improving classification accuracy by identifying and removing instances that should be misclassified" (PDF). The 2011 International Joint Conference on Neural Networks. p. 2690. CiteSeerX 10.1.1.221.1371. doi:10.1109/IJCNN.2011.6033571. ISBN 978-1-4244-9635-8..
[7]
^Zimek, Arthur; Filzmoser, Peter (2018). "There and back again: Outlier detection between statistical reasoning and data mining algorithms". Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery. 8 (6): e1280. doi:10.1002/widm.1280. ISSN 1942-4787..
[8]
^Knorr, E. M.; Ng, R. T.; Tucakov, V. (2000). "Distance-based outliers: Algorithms and applications". The VLDB Journal the International Journal on Very Large Data Bases. 8 (3–4): 237–253. CiteSeerX 10.1.1.43.1842. doi:10.1007/s007780050006..
[9]
^Ramaswamy, S.; Rastogi, R.; Shim, K. (2000). Efficient algorithms for mining outliers from large data sets. Proceedings of the 2000 ACM SIGMOD international conference on Management of data – SIGMOD '00. p. 427. doi:10.1145/342009.335437. ISBN 1-58113-217-4..
[10]
^Angiulli, F.; Pizzuti, C. (2002). Fast Outlier Detection in High Dimensional Spaces. Principles of Data Mining and Knowledge Discovery. Lecture Notes in Computer Science. 2431. p. 15. doi:10.1007/3-540-45681-3_2. ISBN 978-3-540-44037-6..
[11]
^Breunig, M. M.; Kriegel, H.-P.; Ng, R. T.; Sander, J. (2000). LOF: Identifying Density-based Local Outliers (PDF). Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data. SIGMOD. pp. 93–104. doi:10.1145/335191.335388. ISBN 1-58113-217-4..
[12]
^Liu, Fei Tony; Ting, Kai Ming; Zhou, Zhi-Hua (December 2008). Isolation Forest. 2008 Eighth IEEE International Conference on Data Mining (in English). pp. 413–422. doi:10.1109/ICDM.2008.17. ISBN 9780769535029.CS1 maint: Unrecognized language (link).
[13]
^Schubert, E.; Zimek, A.; Kriegel, H. -P. (2012). "Local outlier detection reconsidered: A generalized view on locality with applications to spatial, video, and network outlier detection". Data Mining and Knowledge Discovery. 28: 190–237. doi:10.1007/s10618-012-0300-z..
[14]
^Kriegel, H. P.; Kröger, P.; Schubert, E.; Zimek, A. (2009). Outlier Detection in Axis-Parallel Subspaces of High Dimensional Data. Advances in Knowledge Discovery and Data Mining. Lecture Notes in Computer Science. 5476. p. 831. doi:10.1007/978-3-642-01307-2_86. ISBN 978-3-642-01306-5..
[15]
^Kriegel, H. P.; Kroger, P.; Schubert, E.; Zimek, A. (2012). Outlier Detection in Arbitrarily Oriented Subspaces. 2012 IEEE 12th International Conference on Data Mining. p. 379. doi:10.1109/ICDM.2012.21. ISBN 978-1-4673-4649-8..
[16]
^Fanaee-T, H.; Gama, J. (2016). "Tensor-based anomaly detection: An interdisciplinary survey". Knowledge-Based Systems. 98: 130–147. doi:10.1016/j.knosys.2016.01.027..
[17]
^Zimek, A.; Schubert, E.; Kriegel, H.-P. (2012). "A survey on unsupervised outlier detection in high-dimensional numerical data". Statistical Analysis and Data Mining. 5 (5): 363–387. doi:10.1002/sam.11161..
[18]
^Schölkopf, B.; Platt, J. C.; Shawe-Taylor, J.; Smola, A. J.; Williamson, R. C. (2001). "Estimating the Support of a High-Dimensional Distribution". Neural Computation. 13 (7): 1443–71. CiteSeerX 10.1.1.4.4106. doi:10.1162/089976601750264965. PMID 11440593..
[19]
^Hawkins, Simon; He, Hongxing; Williams, Graham; Baxter, Rohan (2002). "Outlier Detection Using Replicator Neural Networks". Data Warehousing and Knowledge Discovery. Lecture Notes in Computer Science. 2454. pp. 170–180. CiteSeerX 10.1.1.12.3366. doi:10.1007/3-540-46145-0_17. ISBN 978-3-540-44123-6..
[20]
^He, Z.; Xu, X.; Deng, S. (2003). "Discovering cluster-based local outliers". Pattern Recognition Letters. 24 (9–10): 1641–1650. CiteSeerX 10.1.1.20.4242. doi:10.1016/S0167-8655(03)00003-5..
[21]
^Campello, R. J. G. B.; Moulavi, D.; Zimek, A.; Sander, J. (2015). "Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection". ACM Transactions on Knowledge Discovery from Data. 10 (1): 5:1–51. doi:10.1145/2733381..
[22]
^Lazarevic, A.; Kumar, V. (2005). Feature bagging for outlier detection. Proc. 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining. pp. 157–166. CiteSeerX 10.1.1.399.425. doi:10.1145/1081870.1081891. ISBN 978-1-59593-135-1..
[23]
^Nguyen, H. V.; Ang, H. H.; Gopalkrishnan, V. (2010). Mining Outliers with Ensemble of Heterogeneous Detectors on Random Subspaces. Database Systems for Advanced Applications. Lecture Notes in Computer Science. 5981. p. 368. doi:10.1007/978-3-642-12026-8_29. ISBN 978-3-642-12025-1..
[24]
^Kriegel, H. P.; Kröger, P.; Schubert, E.; Zimek, A. (2011). Interpreting and Unifying Outlier Scores. Proceedings of the 2011 SIAM International Conference on Data Mining. pp. 13–24. CiteSeerX 10.1.1.232.2719. doi:10.1137/1.9781611972818.2. ISBN 978-0-89871-992-5..
[25]
^Schubert, E.; Wojdanowski, R.; Zimek, A.; Kriegel, H. P. (2012). On Evaluation of Outlier Rankings and Outlier Scores. Proceedings of the 2012 SIAM International Conference on Data Mining. pp. 1047–1058. doi:10.1137/1.9781611972825.90. ISBN 978-1-61197-232-0..
[26]
^Zimek, A.; Campello, R. J. G. B.; Sander, J. R. (2014). "Ensembles for unsupervised outlier detection". ACM SIGKDD Explorations Newsletter. 15: 11–22. doi:10.1145/2594473.2594476..
[27]
^Zimek, A.; Campello, R. J. G. B.; Sander, J. R. (2014). Data perturbation for outlier detection ensembles. Proceedings of the 26th International Conference on Scientific and Statistical Database Management – SSDBM '14. p. 1. doi:10.1145/2618243.2618257. ISBN 978-1-4503-2722-0..
[28]
^Campos, Guilherme O.; Zimek, Arthur; Sander, Jörg; Campello, Ricardo J. G. B.; Micenková, Barbora; Schubert, Erich; Assent, Ira; Houle, Michael E. (2016). "On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study". Data Mining and Knowledge Discovery. 30 (4): 891. doi:10.1007/s10618-015-0444-8. ISSN 1384-5810..
[29]
^Anomaly detection benchmark data repository of the Ludwig-Maximilians-Universität München; Mirror at University of São Paulo..
[30]
^Denning, D. E. (1987). "An Intrusion-Detection Model" (PDF). IEEE Transactions on Software Engineering. SE-13 (2): 222–232. CiteSeerX 10.1.1.102.5127. doi:10.1109/TSE.1987.232894..
[31]
^Teng, H. S.; Chen, K.; Lu, S. C. (1990). Adaptive real-time anomaly detection using inductively generated sequential patterns (PDF). Proceedings of the IEEE Computer Society Symposium on Research in Security and Privacy. pp. 278–284. doi:10.1109/RISP.1990.63857. ISBN 978-0-8186-2060-7..
[32]
^Jones, Anita K.; Sielken, Robert S. (1999). "Computer System Intrusion Detection: A Survey". Technical Report, Department of Computer Science, University of Virginia, Charlottesville, VA. CiteSeerX 10.1.1.24.7802..

阅读 735