使用AWS的10个最佳实践

最佳实践#1：可伸缩性

基础设施简单、无缝伸缩是在AWS云上运行Workload的最大原因之一。

反面模式1：全容量运行。当发生问题（故障或容量激增），由于没有更多的备用资源可用，用户无法访问应用程序。

反面模式2：手工伸缩。发现服务器满负荷运行，手工启动一个或多个新实例扩容。不幸的是，从启动实例到可以使用之间总是有几分钟的时间，这段时间用户无法访问应用程序。

使用支持可伸缩性的最佳实践，此模式旨在及时预测需求并交付更多容量。Amazon CloudWatch监控解决方案可以检测整个服务器集群的总负载是否达到了指定的阈值。这个阈值可以是"CPU利用率保持在60%以上超过5分钟"，或者任何与资源使用相关的Metric。使用CloudWatch，可以基于特定的应用程序设计自定义指标，根据需要触发伸缩。当触发警报时，EC2 Auto-Scaling会立即启动一个新实例，在达到容量预警前就已准备好实例，为用户提供无缝体验。

确保基础设施每一层的可伸缩性，保证架构能随时应对变化。理想情况下，还应该设计系统Scale-in，以便在需求下降时收缩，这样就无需为不再需要的实例付费。

最佳实践#2：自动化

AWS几乎在基础设施的每一层都提供了内置的监视和自动化工具。利用这些工具确保基础设施能够快速响应变更、自动检测不健康的资源并启动替换资源，并且在资源分配发生更改时及时通知。

在应用体系中引入一种或多种自动化，包括无服务器管理和部署、基础设施管理和部署、警报和事件，自动地提供、终止和配置资源，以确保更具弹性、可伸缩性和性能。

最佳实践#3：使用一次性资源

在传统的IT基础设施环境中，通常提前采购新硬件，使用固定资源，手工登录到服务器、安装软件、修复补丁、配置文件、分配IP地址、测试、运行等。这既昂贵又缺乏灵活性，升级难度更大。

对于长时间运行的服务器，另一个问题是"配置漂移"。随着时间的推移，不同环境中应用变更和软件补丁可能导致配置迥异。

使用AWS设计，可以利用云计算的动态供应特性，将服务器和其他组件视为临时资源。可以启动任意数量的应用程序，只在需要时使用它们。

当出现问题或需要更新时，问题服务器被具有最新配置的新服务器替换。这使资源始终处于一致的状态，并使回滚更容易。无状态结构更易支持这一特点。

反面模式	最佳实践
随着时间推移，不同的服务器终止时具有不同的配置	使用相同配置自动部署新资源
不需要时也在运行	终止不再使用的资源
硬编码IP地址缺乏弹性	自动转到新IP
在运行中的硬件中测试更新很不方便	在新资源测试更新，然后用新资源替代老资源

利用云计算的动态供应特性。这一最佳实践将基础设施视为软件而不是硬件。使用硬件，很容易在特定组件上"买进"太多，这使得在必要时升级变得更加困难。用不同的方式思考资源的使用方式，将之视为容易替换的资源，可以快速响应容量需求的变化、升级应用和管理底层软件。

最佳实践#4：松耦合

传统的基础设施与服务器组紧密集成，每台服务器都有特定的用途。然而，当其中一个组件/层发生故障时，对系统的破坏可能是致命的。如果在一层添加或删除服务器，还必须连接每个连接层上的每台服务器。

如果可能的话，使用松散耦合，可以利用托管解决方案作为系统层之间的代理。在这种情况下，每一层的故障和扩展由代理自动处理。解耦组件的两个主要解决方案是负载平衡器和消息队列。左边的图展示了一组紧密耦合的web和应用服务器。右边的图显示了一个负载平衡器，它在web服务器和应用服务器之间路由请求。

在右侧，如果一个应用服务器宕机，弹性负载均衡器自动将所有流量引导到两个正常的服务器。在左边，如果一个应用程序服务器宕机，在试图访问web服务器和该服务器之间的连接时将导致出错。

最佳实践#5：服务而非服务器

开发、管理和操作应用程序，特别是大规模应用程序，需要各种各样的底层技术组件。使用传统的IT基础设施，公司将不得不构建和运行所有这些组件。

AWS提供了一组广泛的计算、存储、数据库、分析、应用程序和部署服务，帮助组织更快地移动和降低IT成本。

没有利用这种架构的宽度 (例如，只使用EC2)就不会充分利用云计算，并且可能会错过提高开发人员生产力和操作效率的机会。

最佳实践充分利用AWS服务的广泛性，而不是仅使用服务器。

反模式	最佳实践
简单应用持续运行在服务器上	按需提供无服务器方案
应用间直接互相通信	应用间通过消息队列通信
静态web资源存储在本地实例上	静态web资源存储在外部，如S3上
后端服务处理用户授权和用户状态	AWS服务管理用户授权和用户状态

尽管EC2在如何提供解决方案方面提供了极大的灵活性，但它不应该是满足所有需求的第一解决方案。AWS提供的无服务器解决方案和托管服务可以解决许多需求，而无需提供、配置和管理EC2实例。AWS Lambda、Amazon Simple Queue Service、Amazon DynamoDB、Elastic Load balance、Amazon Simple Email Service和Amazon Cognito等解决方案可以更低的成本替换基于服务器的解决方案，取而代之的是配置文件更简单、性能更好的托管解决方案。

最佳实践#6：选择正确的数据库

对于传统数据中心，硬件和License会限制数据库方案的选择。在AWS，这些约束被开源的托管数据库消除。AWS提供了数据存储选项，在选择数据库时提供了更大的灵活性，为每个工作负载选择正确的数据库技术。以下问题可以帮助决策在架构中包含哪些解决方案：

这是一个重读、重写还是平衡的工作负载?每秒需要多少读和写?如果用户数量增加，这些值将如何变化?
需要存储多少数据，存储多长时间?这种增长会有多快?在不久的将来会有上限吗?每个对象的大小(平均值、最小值、最大值)是多少?如何访问这些对象?
在数据持久性方面有哪些要求?
延迟需求是什么?需要支持多少并发用户?
数据模型是什么?将如何查询数据?查询本质上是关系型的吗?能否创建更容易伸缩的更平坦的数据结构?
需要什么样的功能?是需要强大的完整性控制，还是需要更大的灵活性?是否需要复杂的报告或搜索功能?开发人员是否比NoSQL更熟悉关系数据库?
数据库许可成本是多少?这些成本是否考虑了应用程序开发投资、存储和使用成本?许可模式是否支持预期的增长?是否可以使用Amazon Aurora等云原生数据库引擎来获得开源数据库的简单性和成本效益?

最佳实践#7：避免单点故障

应假设每一点都可能Fail，设计恢复措施。在可能的情况下，从架构中消除单点故障。但这并不意味着每个组件在任何时候都必须冗余。根据停机SLA，可以只在需要时启动自动化解决方案，或者使用AWS自动替换故障底层硬件的托管服务。

上面这个简单系统显示了连接到单个数据库服务器的两个应用服务器。数据库服务器是一个单点故障。当它不可用或性能下降时，应用程序也会同样受到影响。单点故障需要避免，即使底层物理硬件发生故障或被删除/替换，应用程序应该继续运行。

解决单个数据库服务器问题的常见方法是创建一个备用服务器并复制数据。如果主数据库服务器脱机，备用服务器可以接管负载。注意，当主数据库脱机时，应用程序服务器需要自动将其请求发送到辅助数据库。这又回到了最佳实践#3：将资源视为一次性资源，并设计应用程序来支持硬件更改。

最佳实践#8：成本优化

利用AWS的弹性增强成本效率。需要考虑：

资源大小与负载匹配吗？
哪些指标需要监控？
确保没有使用的资源被关掉；
使用资源的频率；
可以利用托管服务替代现有服务器吗？

AWS的另一个优势是能够匹配成本要求，用运营支出(OPEX)替换资本性支出(CAPEX)。构建基础设施成本的最好方法是确保只支付所需要的东西。

此外， AWS每个服务通常有不同的定价层和模型，或者在每个服务中有不同的配置，可以利用它们来优化成本。

最佳实践#9：合理使用缓存

使用缓存减少冗余数据的检索操作。

缓存是将数据或文件临时存储在请求程序和永久存储之间的中间位置，目的是使将来的请求更快，并降低网络吞吐量。例如，在上面的反面模式中：

1. Amazon S3 bucket没有使用缓存服务

2.三个用户分别从Amazon S3 bucket中请求文件

3.文件以相同的方式交付给每个用户，每个请求花费相同的时间和费用

将其与更好的模式进行比较：

1. 在最佳实践中，在Amazon S3前面提供缓存

2. 在这个场景中，第一个请求检查CloudFront中的文件，找不到时，从Amazon S3中提取文件，并将文件的副本存储在CloudFront中离用户最近的边缘位置

3. 当其他用户请求该文件时，是从CloudFront中较近的边缘位置获取的，而不是从Amazon S3获取的。这样减少了延迟和成本，因为在第一个请求之后，不再需要为Amazon S3传输文件付费

最佳实践#10：安全

安全是第一要务，确保基础设施安全性的保护落实到系统的每一层。需要考虑：

隔离基础设施的各个部分
利用托管服务
加密传输和静态数据
记录访问日记
严格执行访问控制，使用最小权限原则
自动部署以保持安全一致性
使用多因素认证

安全不仅是保护基础设施的外部边界，还要确保各个组件彼此之间是安全的。例如，在Amazon EC2中，可以设置安全组，确定实例上的哪些端口可以发送和接收流量，以及这些流量来自或前往何处。

可以使用此功能来降低单一实例上的安全威胁传播到环境中的其他实例的可能性。其他服务也应采取类似的预防措施。

Best Practice #1: Scalability

Simple, seamless infrastructure scaling is one of the main reasons for running workloads on AWS cloud.

Anti-pattern 1: Running at full capacity. When problems occur (failures or capacity spikes), users cannot access applications because no additional resources are available.

Anti-pattern 2: Manual scaling. When servers are found running at full capacity, manually launch one or more new instances for expansion. Unfortunately, there are always a few minutes between launching an instance and making it available, during which users cannot access the application.

Using best practices that support scalability, this pattern aims to predict demand in a timely manner and deliver more capacity. Amazon CloudWatch monitoring solution can detect whether the total load across the server cluster has reached a specified threshold. This threshold could be "CPU utilization stays above 60% for more than 5 minutes," or any metric related to resource usage. Using CloudWatch, you can design custom metrics based on specific applications and trigger scaling as needed. When an alarm is triggered, EC2 Auto-Scaling immediately launches a new instance, ready before capacity warnings are reached, providing users with a seamless experience.

Ensure scalability at every layer of the infrastructure, guaranteeing the architecture can respond to changes at any time. Ideally, you should also design systems to scale-in so they can shrink when demand drops, avoiding paying for instances that are no longer needed.

Best Practice #2: Automation

AWS provides built-in monitoring and automation tools at almost every layer of infrastructure. Use these tools to ensure infrastructure can quickly respond to changes, automatically detect unhealthy resources and launch replacement resources, and provide timely notifications when resource allocation changes.

Introduce one or more types of automation in the application system, including serverless management and deployment, infrastructure management and deployment, alarms and events, automatically provision, terminate, and configure resources to ensure greater resilience, scalability, and performance.

Best Practice #3: Use Disposable Resources

In traditional IT infrastructure environments, new hardware is usually purchased in advance, using fixed resources, manually logging into servers, installing software, patching, configuring files, assigning IP addresses, testing, running, etc. This is both expensive and lacks flexibility, making upgrades more difficult.

For long-running servers, another problem is "configuration drift." Over time, applying changes and software patches in different environments may lead to very different configurations.

Using AWS design, you can leverage the dynamic provisioning characteristics of cloud computing, treating servers and other components as temporary resources. You can launch any number of applications and use them only when needed.

When problems occur or updates are needed, problematic servers are replaced with new servers with the latest configuration. This keeps resources in a consistent state and makes rollbacks easier. Stateless architectures more easily support this characteristic.

Anti-pattern	Best Practice
Over time, different servers terminate with different configurations	Automatically deploy new resources with the same configuration
Running even when not needed	Terminate resources that are no longer used
Hardcoded IP addresses lack flexibility	Automatically switch to new IP
Testing updates on running hardware is inconvenient	Test updates on new resources, then replace old resources with new ones

Leverage the dynamic provisioning characteristics of cloud computing. This best practice treats infrastructure as software rather than hardware. With hardware, it's easy to "buy in" too much on specific components, making upgrades more difficult when necessary. Thinking differently about resource usage, treating them as easily replaceable resources, allows quick response to changes in capacity demands, upgrading applications, and managing underlying software.

Best Practice #4: Loose Coupling

Traditional infrastructure is tightly integrated with server groups, each server having a specific purpose. However, when one component/layer fails, the damage to the system can be fatal. If servers are added or removed in one layer, every server on each connected layer must also be connected.

If possible, use loose coupling, leveraging managed solutions as proxies between system layers. In this case, failure and scaling of each layer are automatically handled by the proxy. Two main solutions for decoupling components are load balancers and message queues. The left diagram shows a set of tightly coupled web and application servers. The right diagram shows a load balancer routing requests between web servers and application servers.

On the right, if one application server goes down, the elastic load balancer automatically directs all traffic to the two healthy servers. On the left, if one application server goes down, errors will occur when trying to access connections between web servers and that server.

Best Practice #5: Services Not Servers

Developing, managing, and operating applications, especially large-scale applications, requires a variety of underlying technical components. Using traditional IT infrastructure, companies would have to build and run all these components.

AWS provides a broad set of compute, storage, database, analytics, application, and deployment services to help organizations move faster and reduce IT costs.

Not leveraging the breadth of this architecture (e.g., only using EC2) means not fully utilizing cloud computing and may miss opportunities to improve developer productivity and operational efficiency.

Best practices fully leverage the breadth of AWS services, rather than just using servers.

Anti-pattern	Best Practice
Simple applications continuously run on servers	Provide serverless solutions on demand
Applications communicate directly with each other	Applications communicate via message queues
Static web resources stored on local instances	Static web resources stored externally, such as on S3
Backend services handle user authorization and user state	AWS services manage user authorization and user state

Although EC2 provides great flexibility in how to deliver solutions, it should not be the first solution for all needs. AWS serverless solutions and managed services can solve many needs without provisioning, configuring, and managing EC2 instances. Solutions like AWS Lambda, Amazon Simple Queue Service, Amazon DynamoDB, Elastic Load Balancer, Amazon Simple Email Service, and Amazon Cognito can replace server-based solutions at lower costs, with simpler configuration files and better performing managed solutions.

Best Practice #6: Choose the Right Database

For traditional data centers, hardware and licenses can limit database solution choices. On AWS, these constraints are eliminated by open-source managed databases. AWS provides data storage options, offering greater flexibility when choosing databases, selecting the right database technology for each workload. The following questions can help decide which solutions to include in the architecture:

Is this a read-heavy, write-heavy, or balanced workload? How many reads and writes per second are needed? How will these values change if user numbers increase?
How much data needs to be stored, and for how long? How fast will this grow? Will there be an upper limit in the near future? What is the size of each object (average, minimum, maximum)? How are these objects accessed?
What are the requirements for data durability?
What are the latency requirements? How many concurrent users need to be supported?
What is the data model? How will data be queried? Are queries essentially relational? Can flatter data structures be created that are easier to scale?
What functionality is needed? Is strong integrity control needed, or greater flexibility? Are complex reporting or search capabilities needed? Are developers more familiar with relational databases than NoSQL?
What are the database licensing costs? Do these costs account for application development investment, storage, and usage costs? Does the licensing model support expected growth? Can cloud-native database engines like Amazon Aurora be used to gain the simplicity and cost-effectiveness of open-source databases?

Best Practice #7: Avoid Single Points of Failure

Assume every point can fail and design recovery measures. Where possible, eliminate single points of failure from the architecture. However, this doesn't mean every component must be redundant at all times. Depending on downtime SLAs, automated solutions can be launched only when needed, or managed services can be used where AWS automatically replaces failed underlying hardware.

The simple system above shows two application servers connected to a single database server. The database server is a single point of failure. When it's unavailable or performance degrades, the application is similarly affected. Single points of failure need to be avoided; even if underlying physical hardware fails or is deleted/replaced, the application should continue running.

A common solution to the single database server problem is to create a standby server and replicate data. If the primary database server goes offline, the standby server can take over the load. Note that when the primary database goes offline, application servers need to automatically send their requests to the secondary database. This goes back to Best Practice #3: treat resources as disposable and design applications to support hardware changes.

Best Practice #8: Cost Optimization

Leverage AWS elasticity to enhance cost efficiency. Consider:

Does resource size match the load?
Which metrics need monitoring?
Ensure unused resources are turned off;
Frequency of resource usage;
Can managed services replace existing servers?

Another advantage of AWS is the ability to match cost requirements, replacing capital expenditure (CAPEX) with operational expenditure (OPEX). The best way to build infrastructure costs is to ensure you only pay for what you need.

Additionally, each AWS service typically has different pricing tiers and models, or different configurations within each service, which can be leveraged to optimize costs.

Best Practice #9: Use Caching Appropriately

Use caching to reduce redundant data retrieval operations.

Caching is temporarily storing data or files in an intermediate location between the requester and permanent storage, aiming to make future requests faster and reduce network throughput. For example, in the anti-pattern above:

1. Amazon S3 bucket is not using a caching service

2. Three users separately request files from the Amazon S3 bucket

3. Files are delivered to each user in the same way, each request taking the same time and cost

Compare this with a better pattern:

1. In the best practice, provide caching in front of Amazon S3

2. In this scenario, the first request checks for the file in CloudFront, doesn't find it, retrieves the file from Amazon S3, and stores a copy of the file at the CloudFront edge location closest to the user

3. When other users request the file, it's retrieved from the closer CloudFront edge location rather than from Amazon S3. This reduces latency and cost because after the first request, there's no need to pay for Amazon S3 file transfer

Best Practice #10: Security

Security is the top priority, ensuring infrastructure security protection is implemented at every layer of the system. Consider:

Isolate various parts of the infrastructure
Leverage managed services
Encrypt data in transit and at rest
Log access records
Strictly enforce access control, using the principle of least privilege
Automate deployment to maintain security consistency
Use multi-factor authentication

Security is not just about protecting the external boundaries of infrastructure, but also ensuring individual components are secure from each other. For example, in Amazon EC2, you can set security groups to determine which ports on instances can send and receive traffic, and where that traffic comes from or goes to.

This feature can be used to reduce the likelihood of security threats on a single instance spreading to other instances in the environment. Similar precautions should be taken for other services.

返回技术博客

10 Best Practices for Using AWS