Best Practice #1: Scalability
Simple, seamless infrastructure scaling is one of the main reasons for running workloads on AWS cloud.
Anti-pattern 1: Running at full capacity. When problems occur (failures or capacity spikes), users cannot access applications because no additional resources are available.
Anti-pattern 2: Manual scaling. When servers are found running at full capacity, manually launch one or more new instances for expansion. Unfortunately, there are always a few minutes between launching an instance and making it available, during which users cannot access the application.
Using best practices that support scalability, this pattern aims to predict demand in a timely manner and deliver more capacity. Amazon CloudWatch monitoring solution can detect whether the total load across the server cluster has reached a specified threshold. This threshold could be "CPU utilization stays above 60% for more than 5 minutes," or any metric related to resource usage. Using CloudWatch, you can design custom metrics based on specific applications and trigger scaling as needed. When an alarm is triggered, EC2 Auto-Scaling immediately launches a new instance, ready before capacity warnings are reached, providing users with a seamless experience.
Ensure scalability at every layer of the infrastructure, guaranteeing the architecture can respond to changes at any time. Ideally, you should also design systems to scale-in so they can shrink when demand drops, avoiding paying for instances that are no longer needed.
Best Practice #2: Automation
AWS provides built-in monitoring and automation tools at almost every layer of infrastructure. Use these tools to ensure infrastructure can quickly respond to changes, automatically detect unhealthy resources and launch replacement resources, and provide timely notifications when resource allocation changes.
Introduce one or more types of automation in the application system, including serverless management and deployment, infrastructure management and deployment, alarms and events, automatically provision, terminate, and configure resources to ensure greater resilience, scalability, and performance.
Best Practice #3: Use Disposable Resources
In traditional IT infrastructure environments, new hardware is usually purchased in advance, using fixed resources, manually logging into servers, installing software, patching, configuring files, assigning IP addresses, testing, running, etc. This is both expensive and lacks flexibility, making upgrades more difficult.
For long-running servers, another problem is "configuration drift." Over time, applying changes and software patches in different environments may lead to very different configurations.
Using AWS design, you can leverage the dynamic provisioning characteristics of cloud computing, treating servers and other components as temporary resources. You can launch any number of applications and use them only when needed.
When problems occur or updates are needed, problematic servers are replaced with new servers with the latest configuration. This keeps resources in a consistent state and makes rollbacks easier. Stateless architectures more easily support this characteristic.
| Anti-pattern | Best Practice |
| Over time, different servers terminate with different configurations | Automatically deploy new resources with the same configuration |
| Running even when not needed | Terminate resources that are no longer used |
| Hardcoded IP addresses lack flexibility | Automatically switch to new IP |
| Testing updates on running hardware is inconvenient | Test updates on new resources, then replace old resources with new ones |
Leverage the dynamic provisioning characteristics of cloud computing. This best practice treats infrastructure as software rather than hardware. With hardware, it's easy to "buy in" too much on specific components, making upgrades more difficult when necessary. Thinking differently about resource usage, treating them as easily replaceable resources, allows quick response to changes in capacity demands, upgrading applications, and managing underlying software.
Best Practice #4: Loose Coupling
Traditional infrastructure is tightly integrated with server groups, each server having a specific purpose. However, when one component/layer fails, the damage to the system can be fatal. If servers are added or removed in one layer, every server on each connected layer must also be connected.
If possible, use loose coupling, leveraging managed solutions as proxies between system layers. In this case, failure and scaling of each layer are automatically handled by the proxy. Two main solutions for decoupling components are load balancers and message queues. The left diagram shows a set of tightly coupled web and application servers. The right diagram shows a load balancer routing requests between web servers and application servers.
On the right, if one application server goes down, the elastic load balancer automatically directs all traffic to the two healthy servers. On the left, if one application server goes down, errors will occur when trying to access connections between web servers and that server.
Best Practice #5: Services Not Servers
Developing, managing, and operating applications, especially large-scale applications, requires a variety of underlying technical components. Using traditional IT infrastructure, companies would have to build and run all these components.
AWS provides a broad set of compute, storage, database, analytics, application, and deployment services to help organizations move faster and reduce IT costs.
Not leveraging the breadth of this architecture (e.g., only using EC2) means not fully utilizing cloud computing and may miss opportunities to improve developer productivity and operational efficiency.
Best practices fully leverage the breadth of AWS services, rather than just using servers.
| Anti-pattern | Best Practice |
| Simple applications continuously run on servers | Provide serverless solutions on demand |
| Applications communicate directly with each other | Applications communicate via message queues |
| Static web resources stored on local instances | Static web resources stored externally, such as on S3 |
| Backend services handle user authorization and user state | AWS services manage user authorization and user state |
Although EC2 provides great flexibility in how to deliver solutions, it should not be the first solution for all needs. AWS serverless solutions and managed services can solve many needs without provisioning, configuring, and managing EC2 instances. Solutions like AWS Lambda, Amazon Simple Queue Service, Amazon DynamoDB, Elastic Load Balancer, Amazon Simple Email Service, and Amazon Cognito can replace server-based solutions at lower costs, with simpler configuration files and better performing managed solutions.
Best Practice #6: Choose the Right Database
For traditional data centers, hardware and licenses can limit database solution choices. On AWS, these constraints are eliminated by open-source managed databases. AWS provides data storage options, offering greater flexibility when choosing databases, selecting the right database technology for each workload. The following questions can help decide which solutions to include in the architecture:
- Is this a read-heavy, write-heavy, or balanced workload? How many reads and writes per second are needed? How will these values change if user numbers increase?
- How much data needs to be stored, and for how long? How fast will this grow? Will there be an upper limit in the near future? What is the size of each object (average, minimum, maximum)? How are these objects accessed?
- What are the requirements for data durability?
- What are the latency requirements? How many concurrent users need to be supported?
- What is the data model? How will data be queried? Are queries essentially relational? Can flatter data structures be created that are easier to scale?
- What functionality is needed? Is strong integrity control needed, or greater flexibility? Are complex reporting or search capabilities needed? Are developers more familiar with relational databases than NoSQL?
- What are the database licensing costs? Do these costs account for application development investment, storage, and usage costs? Does the licensing model support expected growth? Can cloud-native database engines like Amazon Aurora be used to gain the simplicity and cost-effectiveness of open-source databases?
Best Practice #7: Avoid Single Points of Failure
Assume every point can fail and design recovery measures. Where possible, eliminate single points of failure from the architecture. However, this doesn't mean every component must be redundant at all times. Depending on downtime SLAs, automated solutions can be launched only when needed, or managed services can be used where AWS automatically replaces failed underlying hardware.
The simple system above shows two application servers connected to a single database server. The database server is a single point of failure. When it's unavailable or performance degrades, the application is similarly affected. Single points of failure need to be avoided; even if underlying physical hardware fails or is deleted/replaced, the application should continue running.
A common solution to the single database server problem is to create a standby server and replicate data. If the primary database server goes offline, the standby server can take over the load. Note that when the primary database goes offline, application servers need to automatically send their requests to the secondary database. This goes back to Best Practice #3: treat resources as disposable and design applications to support hardware changes.
Best Practice #8: Cost Optimization
Leverage AWS elasticity to enhance cost efficiency. Consider:
- Does resource size match the load?
- Which metrics need monitoring?
- Ensure unused resources are turned off;
- Frequency of resource usage;
- Can managed services replace existing servers?
Another advantage of AWS is the ability to match cost requirements, replacing capital expenditure (CAPEX) with operational expenditure (OPEX). The best way to build infrastructure costs is to ensure you only pay for what you need.
Additionally, each AWS service typically has different pricing tiers and models, or different configurations within each service, which can be leveraged to optimize costs.
Best Practice #9: Use Caching Appropriately
Use caching to reduce redundant data retrieval operations.
Caching is temporarily storing data or files in an intermediate location between the requester and permanent storage, aiming to make future requests faster and reduce network throughput. For example, in the anti-pattern above:
1. Amazon S3 bucket is not using a caching service
2. Three users separately request files from the Amazon S3 bucket
3. Files are delivered to each user in the same way, each request taking the same time and cost
Compare this with a better pattern:
1. In the best practice, provide caching in front of Amazon S3
2. In this scenario, the first request checks for the file in CloudFront, doesn't find it, retrieves the file from Amazon S3, and stores a copy of the file at the CloudFront edge location closest to the user
3. When other users request the file, it's retrieved from the closer CloudFront edge location rather than from Amazon S3. This reduces latency and cost because after the first request, there's no need to pay for Amazon S3 file transfer
Best Practice #10: Security
Security is the top priority, ensuring infrastructure security protection is implemented at every layer of the system. Consider:
- Isolate various parts of the infrastructure
- Leverage managed services
- Encrypt data in transit and at rest
- Log access records
- Strictly enforce access control, using the principle of least privilege
- Automate deployment to maintain security consistency
- Use multi-factor authentication
Security is not just about protecting the external boundaries of infrastructure, but also ensuring individual components are secure from each other. For example, in Amazon EC2, you can set security groups to determine which ports on instances can send and receive traffic, and where that traffic comes from or goes to.
This feature can be used to reduce the likelihood of security threats on a single instance spreading to other instances in the environment. Similar precautions should be taken for other services.