amazon 使用密码登录_我们通过使用Amazon SageMaker大规模提供机器学习模型学到了什么...

阅读量：2519 次

发布时间：2019-05-11

本文共 18770 字，大约阅读时间需要 62 分钟。

amazon 使用密码登录

by Daitan

通过大潭

我们通过使用Amazon SageMaker大规模提供机器学习模型学到了什么 (What We Learned by Serving Machine Learning Models at Scale Using Amazon SageMaker)

By Bruno Schionato, Diego Domingos, Fernando Moraes, Gustavo Rozato, Isac Souza, Marciano Nardi, Thalles Silva —

布鲁诺·斯基奥纳托(Bruno Schionato)，迭戈·多明戈斯(Diego Domingos)，费尔南多·莫拉斯(Fernando Moraes)，古斯塔沃·罗萨托(Gustavo Rozato)，艾萨克·索扎(Isac Souza)，马尔恰诺·纳尔迪(Marciano Nardi)，塔雷斯·席尔瓦( Silva)

Last time, we talked about how to . Following our plans, we give a step further and investigate more complete solutions. In this post, we turn our attention to Amazon SageMaker.

上次，我们讨论了如何。按照我们的计划，我们将进一步采取措施，并研究更完整的解决方案。在本文中，我们将注意力转向Amazon SageMaker。

SageMaker is a platform for developing and deploying ML models. It promises to ease the process of training and deploying models to production at scale.

SageMaker是用于开发和部署ML模型的平台。它有望简化培训和将模型大规模部署到生产的过程。

To accomplish this goal, it offers services that aim to solve the various stages of the data science pipeline such as:

为了实现此目标，它提供旨在解决数据科学流程各个阶段的服务，例如：

Data collection and storage
数据收集与存储

Data cleaning and preparation
数据清理和准备

Training and tuning ML models
训练和调整ML模型

Deploy to the cloud at scale
大规模部署到云

With that in mind, SageMaker positions itself as a fully managed ML service.

考虑到这一点，SageMaker将自己定位为完全托管的ML服务。

The typical workflow for creating ML models involve many steps. In this context, SageMaker aims to abstract the process of solving each one of these stages. In fact, as we will see, by using SageMaker’s built-in algorithms, we can deploy our models with a simple line of code.

创建ML模型的典型工作流程涉及许多步骤。在这种情况下，SageMaker旨在抽象化解决这些阶段中每个阶段的过程。实际上，正如我们将看到的那样，通过使用SageMaker的内置算法，我们可以用简单的代码行部署模型。

The process of training, evaluating and deploying is all done using Jupyter notebooks. Jupyter notebook brings many advantages. It gives freedom for experienced data scientists who are already accustomed to the tool. Besides, it also offers flexibility for those that do not have much experience in the area.

培训，评估和部署过程全部使用Jupyter笔记本完成。 Jupyter笔记本电脑带来许多优势。它为已经习惯该工具的经验丰富的数据科学家提供了自由。此外，它还为那些在该领域没有太多经验的人提供了灵活性。

In summary, SageMaker provides many benefits for anyone that would like to train and deploy ML models to production. However, the price can be an issue.

总而言之，SageMaker为想要培训并将ML模型部署到生产中的任何人提供了很多好处。但是，价格可能是一个问题。

Generally, the price depends on how and where you use Amazon’s infrastructure. For obvious reasons, normal machine instances have lower costs than GPU capable instances. Note that different regions have different prices. Also, Amazon groups the machines for different tasks: building, training, and deploying. You can find the full .

通常，价格取决于您使用亚马逊基础设施的方式和地点。出于显而易见的原因，普通的计算机实例的成本要低于支持GPU的实例。请注意，不同地区的价格不同。此外，Amazon将机器分组以完成不同的任务：构建，培训和部署。您可以在找到完整的。

For training, SageMaker offers many of the most popular built-in ML algorithms. Some of them include K-Means, PCA, Sequence models, Linear Learners and XGBoost. Plus, Amazon promises outstanding performance on these implementations.

为了进行培训，SageMaker提供了许多最受欢迎的内置ML算法。其中一些包括K-Means，PCA，序列模型，线性学习器和XGBoost。此外，亚马逊承诺在这些实现上将表现出色。

Moreover, if you want to train a model using a third party library like Keras, SageMaker also gets you covered. Indeed, it supports the most popular ML frameworks. Some of them include:

而且，如果您想使用Keras这样的第三方库训练模型，SageMaker也可以帮助您。实际上，它支持最流行的ML框架。其中一些包括：

Checkout these examples using and .

使用和查看这些示例。

SageMaker —简要概述 (SageMaker — a brief Overview)

To understand how SageMaker works, take a look at the following diagram. Let’s say you want to train a simple Deep Convolution Neural Network (CNN) using Tensorflow.

要了解SageMaker的工作原理，请看下图。假设您想使用Tensorflow训练一个简单的深度卷积神经网络(CNN)。

The first box “Model Files” represents the CNNs definition files. This is your model’s architecture. Convolutions, pooling, and dense layers, for instance, goes there. Note that, here, it is all developed using the framework of choice — Tensorflow in this case.

第一个框“模型文件”代表CNN定义文件。这是您模型的架构。例如，卷积，池化和密集层就到了那里。请注意，这里所有内容都是使用所选框架开发的-在这种情况下为Tensorflow。

Second, we proceed by training the model using that framework. To do that, Amazon launches ML compute instances and uses the training code and dataset to carry out the training process. Then, it saves the final model artifacts and other output in a specified S3 bucket. Note that we can take advantage of parallel training. This can be done via instance parallelism or by having GPU capable machines.

第二，我们使用该框架训练模型。为此，Amazon启动了ML计算实例，并使用训练代码和数据集来执行训练过程。然后，它将最终的模型工件和其他输出保存在指定的S3存储桶中。请注意，我们可以利用并行训练。这可以通过实例并行性或具有GPU的计算机来完成。

Using the model’s artifacts and a simple protocol, it creates a SageMaker model. Finally, this model can be deployed to an endpoint with options regarding the number and type of instances at which to deploy the model.

使用模型的工件和简单的协议，它创建了一个SageMaker模型。最后，可以使用有关部署模型的实例数量和类型的选项，将该模型部署到端点。

SageMaker also has a very interesting mechanism for tuning ML models — the Automatic Model Tuning. Usually, tuning ML models is a very time and computational consuming task. The reason is that the available techniques rely on brute-force methods like grid-search or Random Search.

SageMaker还具有用于调整ML模型的非常有趣的机制-自动模型调整。通常，调整ML模型是一项非常耗时且需要大量计算的任务。原因是可用的技术依赖于诸如网格搜索或随机搜索之类的蛮力方法。

To give an example, using Automatic Model Tuning, we can select a subset of possible optimizers, say Adam and/or SGD, and a few values for the learning rate. Then, the engine will take care of the possible combinations and focus on the set of parameters that yields the best results.

举一个例子，使用自动模型调整，我们可以选择可能的优化器的子集，例如Adam和/或SGD，以及一些学习率值。然后，引擎将处理可能的组合，并专注于产生最佳结果的参数集。

Also, this process scales. We can choose the number of jobs to run in parallel along with the maximum number of jobs to run. After that, Auto Tuning will do the work. This feature works with both third-party libraries and built-in algorithms. Note that Amazon provides Automatic Model Tuning at no extra charge.

同样，此过程可以扩展。我们可以选择要并行运行的作业数以及要运行的最大作业数。之后，自动调整将完成工作。此功能可与第三方库和内置算法一起使用。请注意，亚马逊免费提供自动模型调整。

How about using SageMaker’s deployment capabilities to serve a pre-trained model? That is right, you can either train a new model using the Amazon Cloud or use it to serve a pre-existing model. In other words, you can take advantage of the serving part of SageMaker to deploy models that were trained outside it.

如何使用SageMaker的部署功能来服务于预先训练的模型？没错，您可以使用Amazon Cloud训练新模型，也可以使用它来服务已有的模型。换句话说，您可以利用SageMaker的服务部分来部署经过外部培训的模型。

在SageMaker上进行培训和部署 (Training and Deploying on SageMaker)

As we know, SageMaker offers a variety of popular ML estimators. It also allows the possibility to take a pre-trained model and deploy it. However, based on our experiments, it is much easier to use its built-in implementations. The reason is that to deploy third-party models using the SageMaker’s APIs, one needs to deal with managing containers.

众所周知，SageMaker提供了各种流行的ML估计器。它还允许采用预先训练的模型并进行部署。但是，根据我们的实验，使用其内置实现要容易得多。原因是要使用SageMaker的API部署第三方模型，需要处理容器管理。

Thus, here we pose the challenge of dealing with the complete ML pipeline using SageMaker. We will use it from the most basic to the more advanced ML tasks. Some of the tasks involve:

因此，这里提出了使用SageMaker处理完整ML管道的挑战。我们将使用从最基本的ML到更高级的ML任务。其中一些任务涉及：

Uploading the dataset to an S3 bucket
将数据集上传到S3存储桶

Pre-processing the dataset for training
预处理数据集进行训练

Training and deploying the model
训练和部署模型

Everything is done in the cloud.

一切都在云中完成。

Like in the previous post, we are going to fit a linear model using the KDD99 intrusion dataset. You can find more details about the dataset and pre-processing steps in .

与上一篇文章一样，我们将使用KDD99入侵数据集拟合线性模型。您可以找到有关数据集和预处理步骤的更多细节。

All the process of training and deploying the model is done using SageMaker’s Jupyter notebook interface. It does not need any configuration and the notebook runs on an EC2 instance of your choice. Here, we chose an ml.m4.xlarge EC2 instance for hosting the notebook. We had problems loading the KDD99 dataset using a less powerful instance (due to lack of space).

训练和部署模型的所有过程都是使用SageMaker的Jupyter笔记本界面完成的。它不需要任何配置，并且笔记本计算机可以在您选择的EC2实例上运行。在这里，我们选择了ml.m4.xlarge EC2实例来托管笔记本。由于空间不足，我们在使用功能较弱的实例加载KDD99数据集时遇到了问题。

Take a look at the EC2 machines’ configurations:

看一下EC2机器的配置：

To fit linear models, SageMaker has the algorithm. It provides a solution for both classification and regression. With very few lines, we can define and fit the model on the dataset.

为了拟合线性模型，SageMaker具有算法。它提供了分类和回归的解决方案。只需很少的几行，我们就可以在数据集中定义和拟合模型。

Take a look at the Estimator class. It is a base class that encapsulates all the different built-in algorithms from SageMaker. Among other parameters, some of the most important ones include:

看一下Estimator类。它是一个基类，封装了SageMaker中所有不同的内置算法。在其他参数中，一些最重要的参数包括：

image_name: The container image to use for training.
image_name：用于训练的容器图像。

train_instance_count: Number of EC2 instances used for training.
train_instance_count：用于训练的EC2实例数。

train_instance_type: The type of EC2 instance to use for training.
train_instance_type：用于训练的EC2实例的类型。

output_path: S3 location for saving the training result.
output_path：用于保存训练结果的S3位置。

To define which kind of model we want to use, we set the ‘image_name’ parameter to ‘linear-learner’. To execute the training procedure, we picked a ml.c4.xlarge EC2 instance. It has 4 virtual CPUs and 7.5 GB of RAM.

为了定义我们要使用的模型，我们将“ image_name”参数设置为“ linear-learner”。为了执行训练过程，我们选择了ml.c4.xlarge EC2实例。它具有4个虚拟CPU和7.5 GB RAM。

The model’s hyper-parameters include:

该模型的超参数包括：

feature_dim: the input dimensions
feature_dim：输入尺寸

predictor_type: if classification or regression
预报变量类型：分类或回归

mini_batch_size: how many samples to use per step.
mini_batch_size：每步使用多少个样本。

Finally, SageMaker provides a very alike scikit-learn’s based API for training. Just call the fit() function, and you are in business.

最后，SageMaker提供了一个非常相似的scikit-learn基于API的培训。只需调用fit()函数，就可以开展业务。

Now comes the final part — deployment. To do it, much like when training, we just run one line of code.

现在是最后一部分-部署。为此，就像训练时一样，我们只运行一行代码。

This routine will take care of deploying the trained model to an Amazon endpoint. Note that we need to specify the type of instance we want, in this case, a ml.m4.xlarge EC2 instance. Also, we can define a minimum number of EC2 instances to deploy our model. To do that, we just set the initial_instance_count parameter to a value greater than 1.

该例程将负责将训练后的模型部署到Amazon终端节点。请注意，我们需要指定所需的实例类型，在这种情况下，是ml.m4.xlarge EC2实例。另外，我们可以定义最小数量的EC2实例来部署我们的模型。为此，我们只需将initial_instance_count参数设置为大于1的值。

自动缩放 (Auto Scaling)

We have two main goals with the tests.

测试有两个主要目标。

To evaluate the complete ML pipeline offered by SageMaker
评估SageMaker提供的完整ML管道

To assess training and deploying scalability.
评估培训和部署可伸缩性。

In all tests, we used the SageMaker Auto Scaling tool. As we will see, it helps to control the traffic/instances trade-off.

在所有测试中，我们都使用了SageMaker Auto Scaling工具。正如我们将看到的，它有助于控制流量/实例的权衡。

As stated on the AWS website:

如AWS网站上所述：

AWS Auto Scaling monitors your applications and automatically adjusts capacity to maintain steady, predictable performance at the lowest possible cost.

AWS Auto Scaling监视您的应用程序，并自动调整容量以最低的成本维持稳定，可预测的性能。

In short, SageMaker Auto Scaling makes it easier to build scaling plans for various resources across many services. These services include Amazon EC2, Spot Fleets, Amazon ECS tasks, and more. The idea is to adjust the number of running instances in response to changes in the workload.

简而言之，SageMaker Auto Scaling使跨许多服务的各种资源构建扩展计划变得更加容易。这些服务包括Amazon EC2，Spot Fleets，Amazon ECS任务等。这个想法是为了响应工作负载的变化来调整正在运行的实例的数量。

It is important to note that Auto Scaling might fail in some situations. More specific, when your application suffers some kind of spikes in traffic, Auto Scaling may not help at all. We know that for new (EC2) instances, Amazon needs some time to set up and configure the machine before it is able to process requests. Based on our experiments, this setup time might take from 5 to 7 minutes. If your application has small spikes (let’s say 2 to 4 minutes) in the number of incoming requests, by the time the EC2 instance setup time finishes, the need for more computing power might be over.

重要的是要注意，Auto Scaling在某些情况下可能会失败。更具体地说，当您的应用程序遇到某种流量高峰时，Auto Scaling可能根本无济于事。我们知道，对于新的(EC2)实例，Amazon需要一些时间来设置和配置机器，然后才能处理请求。根据我们的实验，此设置时间可能需要5到7分钟。如果您的应用程序的传入请求数量很少(例如2到4分钟)，那么到EC2实例设置时间结束时，对更多计算能力的需求可能已经结束。

To address this situation, Amazon implements a simple policy to scale new instances. Basically, after a scaling decision takes place, a cool down period has to be satisfied before another scale activity occurs. In other words, each action to issue a new instance is interleaved by a fixed (configurable) amount of time. This mechanism aims to ease the overhead to launch a new machine.

为了解决这种情况，Amazon实施了一个简单的策略来扩展新实例。基本上，在进行扩展决策后，必须满足冷却时间，然后再进行另一次扩展活动。换句话说，每个发出新实例的动作都经过固定(可配置)的时间间隔。该机制旨在减轻启动新机器的开销。

Also, if your application has well defined/predictable user traffic, Auto Scaling might also be a bad choice. Suppose, you host an application’s website. You know that at a specific time, applications will be open for hundreds of millions of users. In this situation, the time required for Auto Scaling to be properly setup may end up in a poor user experience.

另外，如果您的应用程序具有定义良好/可预测的用户流量，则Auto Scaling可能也是一个不好的选择。假设您托管应用程序的网站。您知道，在特定时间，应用程序将为数亿用户打开。在这种情况下，正确设置Auto Scaling所需的时间可能会导致不良的用户体验。

结果 (Results)

We used Tauros and JMeter to run load tests on our ML model developed with Amazon SageMaker.

我们使用Tauros和JMeter对使用Amazon SageMaker开发的ML模型进行负载测试。

The first scenario is defined as follows:

第一种情况定义如下：

Number of concurrent users: 1000
并发用户数：1000

Ramp-up time of 10 minutes
上线时间为10分钟

Hold-for period of 10 minutes
保留10分钟

Simply put, the test consists of issuing requests from 1000 parallel users. In the first part of the test (first 10 minutes) the number of users is scaled from 0 to 1000 (ramp up). After, the 1000 users continue to send parallel requests for 10 more minutes (hold for period). Note that each user sends requests in a serial manner. That is, to issue a new request, a user has to wait until the current one finishes.

简而言之，该测试包括从1000个并行用户发出请求。在测试的第一部分(前10分钟)，用户数量从0扩展到1000(提升)。之后，这1000个用户继续发送并行请求10分钟(保留一段时间)。请注意，每个用户都以串行方式发送请求。也就是说，要发出新请求，用户必须等到当前请求完成为止。

For the first tests, we decided to use a single machine. As a result, we did not define any scaling plan that would spawn new instances upon reaching some criterion.

对于第一个测试，我们决定使用一台机器。结果，我们没有定义任何达到某个标准后会产生新实例的扩展计划。

In the graph below, the blue line (increasing in a staircase shape) is the number of parallel users. The orange line represents the average response time, and the green line the number of requests.

在下图中，蓝线(呈阶梯状增加)是并行用户的数量。橙色线表示平均响应时间，绿色线表示请求数。

In the beginning, the number of users scales from 0 to 1000. As expected, the number of issued requests to the model increases in a similar fashion.

最初，用户数量从0扩展到1000。正如预期的那样，向模型发出的请求数量以类似的方式增加。

In the last part of the experiment (last 10 minutes), the number of hits/requests and the mean response time stays steady. This suggests that this single machine seems to be capable of dealing with the current payload.

在实验的最后部分(最后10分钟)，点击/请求数和平均响应时间保持稳定。这表明该单台计算机似乎能够处理当前的有效负载。

Also, this single machine was able to process an overall average request of 961.3 hits/sec. Actually, after reaching the max number of simultaneous users (1000), this average was nearly 1200 requests/second.

而且，这台机器可以处理961.3次命中/秒的总体平均请求。实际上，达到并发用户数上限(1000)后，该平均值接近每秒1200个请求。

To further access our hypothesis, we decided to add a scaling plan to our loading tests. Here, when the number of parallel requests/minute reaches the 30k mark, we instruct the system to scale up the number of running instances. For all tests, the maximum number of instances was set to 10. However, in all cases the SageMaker Auto Scaling did not use all the available resources.

为了进一步了解我们的假设，我们决定在负载测试中添加缩放比例计划。在这里，当每分钟并行请求数达到30k标记时，我们指示系统扩大正在运行的实例数。对于所有测试，实例的最大数量设置为10。但是，在所有情况下，SageMaker Auto Scaling都不会使用所有可用资源。

For the test below, Amazon Auto Scaling only issued 1 more instance to help processing the current payload. That is represented by the red line in the CPU utilization figure below.

对于以下测试，Amazon Auto Scaling仅再发布1个实例来帮助处理当前有效负载。这由下面的CPU利用率图中的红线表示。

Nevertheless, the addition of this new instance was able to increase the throughput and reduces latency. This is noticeable after the 15:48 time mark.

但是，添加此新实例能够提高吞吐量并减少延迟。在15:48时间标记之后，这一点很明显。

To better access the Auto Scaling tool, we decided to reduce the threshold number of requests/minute before scaling. Now, Auto scaling is advised to launch a new instance as soon as the throughput reaches 15k requests/minute. As a consequence, Auto Scale used a total of 4 instances to match the scaling plan. It is also quite intuitive to see that as the number of instances grow, the CPU percent usage decreases.

为了更好地使用Auto Scaling工具，我们决定减少扩展前的每分钟请求数阈值。现在，建议吞吐量达到15,000个请求/分钟时，自动伸缩就启动一个新实例。结果，“自动缩放”总共使用了4个实例来匹配缩放计划。可以很直观地看出，随着实例数量的增加，CPU使用率降低了。

We noticed that at the beginning of all tests, we had a big spike in latency. Our experiments suggest this high average value is caused by the test itself (Taurus/JMeter) warming up and preparing resources. Note that after the spike, the response time quickly decreases to normal values. Later, it increases along with the number of virtual users (as expected). Also, this initial spike is not seen in the latency statistics for the API Gateway or SageMaker — which supports our initial thoughts.

我们注意到，在所有测试的开始，我们的等待时间都大大增加了。我们的实验表明，较高的平均值是由测试本身(Taurus / JMeter)预热并准备资源引起的。请注意，在尖峰之后，响应时间Swift减少到正常值。后来，它随着虚拟用户数量的增加(如预期的那样)。另外，在API网关或SageMaker的延迟统计中看不到此初始峰值，这支持了我们的初步想法。

Also, specifically for this test and choice of model, Auto Scale was not very effective. The reason is that the amount of load we are performing to the server is completely handled by a single machine.

另外，特别是对于此测试和模型选择，“自动缩放”效果不是很好。原因是我们对服务器执行的负载量完全由单台机器处理。

结论 (Conclusion)

Here are a few of our observations about SageMaker:

以下是我们对SageMaker的一些观察：

It offers a very clean and easy to use interface. Jupyter notebooks offer many advantages and the built-in algorithms are easy to use (scikit-learn based API). Also, the machines used for training are only billed when training is happing. No payment for idle time :)
它提供了一个非常干净且易于使用的界面。 Jupyter笔记本电脑具有许多优势，并且内置算法易于使用(基于scikit-learn的API)。同样，仅在训练暂停时才对用于训练的机器收费。空闲时间不付款:)

It takes away many of the boring tasks of ML. Auto-scaling and auto hyper-parameter tuning are excellent features.
它消除了ML的许多无聊任务。自动缩放和自动超参数调整是出色的功能。

If using the built-in algorithms, deployment is very straight-forward. Just one line of code.
如果使用内置算法，则部署非常简单。只需一行代码。

Through SageMaker supports third-party ML libraries, we found that to serve a pre-trained model is not as straightforward as using their native API.
通过SageMaker支持第三方ML库，我们发现提供预训练模型并不像使用其本机API那样简单。