自动化机器学习

编辑

自动化机器学习(AutoML)是将机器学习应用于现实世界问题的端到端自动化过程。在一个典型的机器学习应用中，工程师将一个由输入数据点组成的数据集进行训练。可能不是所有算法都可以开箱即用地适用于原始数据本身的形式。机器学习的专家可能必须应用适当的数据预处理、特征工程、特征提取和特征选择方法，使数据集适合机器学习。在这些预处理步骤之后，工程师必须选择算法和优化超参数，以最大化其最终机器学习模型的预测性能。由于这些中的许多步骤往往超出了非专家的能力，所以自动化机器学习被提出来作为一种基于人工智能的解决方案，以应对如何应用机器学习这一日益增长的挑战^[1]^[2]。将端到端机器学习的应用过程自动化为此提供了一些优势：产生更简单的解决方案、更快地创建这些解决方案以及通常比手工设计更优的模型。然而，AutoML并不是灵丹妙药，它可以引入自己的额外参数，称为超参数，这可能需要一些专业知识来自行设置。但它确实让非专家更容易应用机器学习。

目录编辑

1 自动化的目标编辑

自动化机器学习可以针对机器学习过程的不同阶段:^[2]

自动数据准备和输入(来自原始数据和其他格式)
- 自动列类型检测；例如布尔、离散数字、连续数字或文本
- 自动列意图检测；例如目标/标签、分层字段、数字特征、分类文本特征或自由文本特征
- 自n 动任务检测；例如二元分类、回归、聚类或排序
自动化特征工程
- 特征选择
- 特征抽取
- 元学习和迁移学习
- 偏斜数据和/或缺失值的检测和处理
自动化模型选择
学习算法的超参数优化及特征化
在时间、内存和复杂度限制下的自动化流水线选择
评估指标/验证程序的自动选择
自动问题检查
- 泄漏检测
- 配置错误检测
自动分析获得的结果
自动化机器学习的用户界面和可视化

2 例子编辑

处理自动化机器学习各个阶段的著名平台:

2.1 超参数优化和模型选择

Auto-WEKA^[3] l 是WEKA之上的贝叶斯超参数优化层。
auto-sklearn^[4] 是scikit-learn之上的贝叶斯超参数优化层。
ATM^[5] 是麻省理工学院人类数据交互项目下的开源软件库。这是一个分布式的、可扩展的自动化机器学习系统，设计时考虑到了易用性。

2.2 全流水线优化

TPOT^[6]^[7] 是一个Python库，它使用遗传编程自动创建和优化完整的机器学习流水线。
H2O无人驾驶人工智能[8]是H2O.ai开发的自动化机器学习平台，用于自动化可视化、特征工程、模型训练、超参数优化和可解释性。
TransmogrifAI^[8]^[9] l 是Salesforce为自动数据清理、特征工程、模型选择和超参数优化创建的Scala/SparkML库
RECIPE ^[10] 是一个在基于语法的遗传程序上设计的框架，它构建了定制的scikit-learn分类流水线。
GA-Auto-MLC和Auto-MEKA是在MEKA软件上执行自动多标签分类的免费方法。

2.3 深度神经网络建筑搜索

Auto Keras^[11] 是一个用于神经网络架构搜索的开源python包。

参考文献

[1]
^Thornton C, Hutter F, Hoos HH, Leyton-Brown K (2013). Auto-WEKA: Combined Selection and Hyperparameter Optimization of Classification Algorithms. KDD '13 Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. pp. 847–855..
[2]
^Hutter F, Caruana R, Bardenet R, Bilenko M, Guyon I, Kegl B, and Larochelle H. "AutoML 2014 @ ICML". AutoML 2014 Workshop @ ICML. Retrieved 2018-03-28..
[3]
^Kotthoff L, Thornton C, Hoos HH, Hutter F, Leyton-Brown K (2017). "Auto-WEKA 2.0: Automatic model selection and hyperparameter optimization in WEKA". Journal of Machine Learning Research. 18 (25): 1–5..
[4]
^Feurer M, Klein A, Eggensperger K, Springenberg J, Blum M, Hutter F (2015). "Efficient and Robust Automated Machine Learning". Advances in Neural Information Processing Systems 28 (NIPS 2015): 2962–2970..
[5]
^Swearingen, Thomas; Drevo, Will; Cyphers, Bennett; Cuesta-Infante, Alfredo; Ross, Arun; Veeramachaneni, Kalyan (December 2017). "ATM: A distributed, collaborative, scalable system for automated machine learning". 2017 IEEE International Conference on Big Data (Big Data). IEEE. doi:10.1109/bigdata.2017.8257923. ISBN 9781538627150..
[6]
^Olson RS, Urbanowicz RJ, Andrews PC, Lavender NA, Kidd L, Moore JH (2016). Automating biomedical data science through tree-based pipeline optimization. Proceedings of EvoStar 2016. Lecture Notes in Computer Science. 9597. pp. 123–137. arXiv:1601.07925. doi:10.1007/978-3-319-31204-0_9. ISBN 978-3-319-31203-3..
[7]
^Olson RS, Bartley N, Urbanowicz RJ, Moore JH (2016). Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science. Proceedings of EvoBIO 2016. Gecco '16. pp. 485–492. arXiv:1603.06212. doi:10.1145/2908812.2908918. ISBN 9781450342063..
[8]
^Shubha Nabar (2018-08-16). "Open Sourcing TransmogrifAI – Automated Machine Learning for Structured Data - Salesforce Engineering". Salesforce Engineering (in English). Retrieved 2018-08-16.CS1 maint: Unrecognized language (link).
[9]
^Kyle Wiggers (2018-08-16). "Salesforce open-sources TransmogrifAI, the machine learning library that powers Einstein". VentureBeat. Retrieved 2018-08-16. Once TransmogrifAI has extracted features from the dataset, it’s primed to begin automated model training. At this stage, it runs a cadre of machine learning algorithms in parallel on the data, automatically selects the best-performing model, and samples and recalibrates predictions to avoid imbalanced data..
[10]
^de Sá, Alex G. C.; Pinto, Walter José G. S.; Oliveira, Luiz Otavio V. B.; Pappa, Gisele L. (2017), "RECIPE: A Grammar-Based Framework for Automatically Evolving Classification Pipelines", Lecture Notes in Computer Science (in 英语), Springer International Publishing, pp. 246–261, doi:10.1007/978-3-319-55696-3_16, ISBN 9783319556956.
[11]
^Haifeng J, Qingquan S, Xia H (2018). "Auto-Keras: Efficient Neural Architecture Search with Network Morphism". arXiv:1806.10282 [cs.LG]..

阅读 71